Integration of behavioral and layout synthesis : a chip synthesis approach by Wu, Allen C.H
UC Irvine
ICS Technical Reports
Title
Integration of behavioral and layout synthesis : a chip synthesis approach
Permalink
https://escholarship.org/uc/item/0sk009vn
Author
Wu, Allen C.H
Publication Date
1992-04-20
 
Peer reviewed
eScholarship.org Powered by the California Digital Library
University of California
1 1 
1 1 
Notice: This Material 
may be protected 
by Copyright Law 
(Title 17 U.S.C.) 
INTEGRATION OF BEHAVIORAL 
AND 
LAYOUT SYNTHESI~ 
A CHIP SYNTHESIS APPROACH 
Allen C-H Wu 
;::::-- ~ 
Dissertation 
U ni versi ty of California 
Irvine, California 92717 
April 20, 1992 
Technical Report No. 92-36 
1 
IJteCtll VES 
z 
~71 
tB 
no, ~:J -30 
~· P 
'H ' . ' • 1 • ~ · 
.. 1. 
1 • 
1 • 
' 1 
Integration of Behavioral and Layout Synthesis: 
A Chip Synthesis Approach 
Allen C-H. Wu 
Technical Report #92-36 
April 20, 1992 
Dept. of Information and Computer Science 
University of California, Irvine 
Irvine, CA 92717 
( 714) 856-8059 
Chip synthesis deals with the transformation of a behavioral description into a fabricated 
chip. Typically, chip synthesis is carried out in three stages: behavioral, logic/sequential 
and layout synthesis. Since chip synthesis involves a multi-level synthesis task, integration 
and coordination of tasks for all levels of synthesis is the essential issue. 
This dissertation addresses a chip synthesis paradigm and describes the key issues with 
regard to the integration of behavioral and layout synthesis for chip design. In order to 
successfully integrate all tasks in the chip synthesis process, a finite-state machine with a 
datapath (FSMD) design model and a sliced-layout architecture have been developed for 
chip synthesis. Using the sliced-layout architecture, a partitioning-based layout synthesis 
method and system have been developed to synthesize layout from generalized register-
transfer (RT) netlists. In addition, based on the FSMD and the sliced-layout architecture, 
area and timing models are developed for behavioral synthesis. To incorporate layout 
information into behavioral synthesis, a unified representation is developed for behavioral 
synthesis. U sing the uni:fied representation and layout model, a layout-driven unit-binding 
approach is presented. Several sets of experiments were performed to validate the proposed 
approaches including the layout-synthesis method, the layout model and the layout-driven 
unit-binding task. 
1 
1 
UNIVERSITY OF CALIFORNIA 
IRVINE 
Integration of Behavioral and Layout Synthesis: 
A Chip Synthesis Approach 
DISSERTATION 
submitted in partial satisfaction of the requirements for the degree of 
DOCTOR OF PHILOSOPHY 
in Information and Computer Science 
Dissertation Committee: 
by 
Chung-Hao ( Allen) Wu 
Professor Daniel D. Gajski, Chair 
Professor Nikil D. Dutt 
Professor Fadi J. K urdahi 
1992 
@1992 
CHUNG-HAO (ALLEN) WU 
ALL RIGHTS RESERVED 
The dissertation of Chung-Hao (Allen) Wu is approved, 
and is acceptable in quality and form for 
U niversity of California, Irvine 
1992 
11 

. 1 
1 
Dedication 
To my grandparents 
'. 1 
1 
111 
1 
1 
1 
1 
1 
1 
1 1 
Contents 
List of Figures . . 
Acknowledgments 
Abstract . 
Chapter 1 Introduction 
1.1 Chip Synthesis Overview . 
1.2 Problem Description 
1.3 Contributions . . . 
1.4 Thesis Overview . . . . 
Chapter 2 Related Work 
2.1 BUD 
2.2 Chippe . 
2.3 Fasolt .. 
2.4 LASSIE 
2.5 Cathedral 
2.6 LAGER .. 
2.7 Summary 
Chapter 3 Target Architecture . . 
3.1 Finite-State Machine with a Datapath 
3.2 Layout Architecture 
3.3 Summary . . . . . . . . . 
Chapter 4 Layout Synthesis 
4.1 System Overview .... . 
4.2 SLAM ......... . 
4.3 Component Partitioning 
4.4 Stack Partitioning ... . 
4.5 Glue-Logic Partitioning .. . 
4.6 Results ... . 
4. 7 Conclusions ....... . 
IV 
VI 
X 
XIV 
1 
1 
6 
7 
11 
12 
13 
14 
15 
15 
16 
16 
17 
19 
..... . 19 
21 
28 
29 
30 
..... 31 
33 
38 
44 
58 
64 
Chapter 5 Quality Measures . . . . . . . . . . . . . . . . . . 
5.1 The Relationship between Structural and Physical Designs 
5.2 Area Measures .... 
5.3 Performance Measures 
5.4 Results . . . . . . . . . 
5.5 Conclusions . . . . . . 
Chapter 6 A U nified Model for Behavioral Synthesis 
6.1 CDFG: Control/Data Flow Graph . 
6.2 The Supergraph Model . . . . . 
6.3 Supergraph and Structure: I . . . . 
6.4 Supergraph and Structure: II . . . . . . . . 
6.5 Extension: A Unified View From System To Module .. 
6.6 Summary .............. · ...... . 
Chapter 7 Binding U sing Layout lnformation 
7.1 ' U nit Binding . . . . . . . . . . . . . . 
7. 2 Back Annotation for Clock Estimation 
7.3 Experiments .. 
7.4 Conclusions ..... . 
Chapter 8 Conclusions . . 
8.1 Summary of Contributions . 
8.2 Future Work ....... . 
V 
65 
66 
69 
81 
94 
. 107 
110 
. 112 
. 113 
. 124 
. 133 
141 
. 145 
147 
. 148 
159 
. 163 
. 176 
177 
. 177 
. 179 
List of Figures 
1.1 Chip synthesis. . . . . . . . . . . . . . . . . . 2 
1.2 A behavioral-synthesis system for chip design. 3 
1.3 A logic-synthesis system for chip design. . . . 4 
1.4 A layout-synthesis system for chip design. . . 5 
1.5 The essential issues in chip synthesis: (a) the design model and 
layout architecture, (b) the area/timing model, ( c) a unified design 
model, ( d) layout-driven behavioral synthesis. . . . . . . . . . . . . 8 
3.1 Generic FSMD block diagram . . . . . . . . . . . . . . . . . . . . . 20 
3.2 Two datapath layout architectures: (a) bit-slice abutment, (b) bit-
sliced macros with channel routing. . . . . . . 22 
3.3 The sliced unit structure. . . . . . . . . . . . . . . . . . . . . . . . 24 
3.4 A four-bit ALU stack: (a) instance, (b) layout. . . . . . . . . . . . 24 
3.5 Switch box insertion for wire alignment: (a) RT netlist, (b) floor-
plan, ( c) switch box insertion. . . . . • . . . . . . . . . . . . . . . 26 
3.6 Two sliced-layout architectures: (a) unfolded stack, (b) folded stack. 27 
4.1 The system block diagram. . . . . . . . . . . . . . . . . . . . . . . 30 
4.2 Graph representation of the RT netlist: (a) RT netlist, (b) its graph 
representation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 
4.3 Stack partitioning based on folding: (a) linear placement, (b) unit 
folding, ( c) overlap checking, ( d) height and width constraint check-
ing, (e) width compression, ( f) stack area. . . . . . . . . . . . . . . 39 
4.4 The adjacency graph formation: (a) a floorplan example, (b) its 
corresponding adjacency graph. . . . . . . . . . . . 46 
4.5 Area dissection and capacity estimation. . . . . . . . . . 49 
4.6 Cut-set adjacency graph. . . . . . . . . . . . . 53 
4. 7 The floorplan of the controlled counter example. 60 
4.8 The layout of the controlled counter example. . . 60 
4.9 The floorplan of the MARKl simple computer. 61 
4.10 The layout of the MARKl simple computer. 61 
4.11 The layout of the DSP example. . . . . . . . . . . . 62 
4.12 The comparisons of our partitioning and floorplanning with a man-
ual partitioning and floorplanning: (a) total area, (b) the critica! 
path wire length. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 
VI 
4.13 The layout of a glue-logic partitioning example. . . . . . . . . 63 
5.1 The relationship between structural and physical designs. 67 
5.2 Two data path layout architectures using: (a) custom cells, (b) 
standard cells. . . . . . . . . . . . . . . . . . . . . . . . . . . 70 
5.3 The layout models: (a) datapath stack, (b) custom cell architecture, 
( c) standard cell architecture. . . . . . . . . . . . . . . . . . . . . . 71 
5.4 Control unit description: (a) state table, (b) Boolean equations for 
output signals, ( c) two-level AND-OR implementation, ( d) two-level 
NAND-NAND implementation, (e) standard cell layout style. 76 
5.5 Different aspect ratios of the control logic: (a) one-row implemen-
tation, (b) three-row implementation. . . . . . . . . . . . 79 
5.6 PLA layout model: (a) logic mapping, (b) layout model. 81 -
5. 7 Constituents of a chip. . . . . . . . . . . . . . . . . . . . . 82 
5.8 Wire: (a) RT model, (b) equivalent RC delay model. . . . . . . . . 83 
5.9 Random-logic model: (a) decomposition of a product term, (b) a 
multi-level implementation, (c) the layout model. 88 
5.10 FSMD docking model. . . . . . . . . . . . . . . . . . . . . . . . . 92 
5.11 The register-transfer path. . . . . . . . . . . . . . . . . . . . . . . 92 
5.12 The datapath area estimates of the elliptic filter example with mux 
implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 
5.13 The datapath area estimates of the elliptic filter example with bus 
implementation. . . . . . . . . . . . . . . . . . . . . . 96 
5.14 The accuracy analysis of the datapath area estimates. . . . . . . . 97 
5.15 The fidelity analysis: (a) good fidelity, (b) poor fidelity. . . . . . . 98 
5.16 Comparative study of the elliptic filter example with different design 
quality measures: (a) mux implementation, (b) bus implementation. 99 
5.17 The dock period for four designs of the elliptical filter benchmark: 
(a) table of data, (b) comparison of different timing estimation 
schemes, ( c) percentage error of each estimation scheme. . ..... 104 
5.18 Delay distribution of constituents of a chip. . ............ 105 
5.19 The estimated dock period and total execution time of four different 
designs of the elliptical filter benchmark. . 106 
6.1 A unified model for behavioral synthesis. 111 
6.2 A hierarchical control/data flow graph representation: (a) a VHDL 
program, (b) the corresponding CD FG. . . . . . . . . . . . . . . . 114 
6.3 The graph formation: (a) a VHDL program, (b) the CDFG, (c) the 
supernode formation. . . . . . . . . . . . . . . . . . . . . . . . . . 118 
6.4 Superedge merging .... . ...................... 119 
6.5 Supergraph_ formation ( cont. ): (a) the superedge formation, (b) the 
structural netlist. . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 
6.6 The point-to-point datapath with one-phase dock architecture: (a) 
control/ data paths, (b) one-phase docking scheme. . . . . . . . . . 125 
Vll 
6. 7 Control-unit formation: (a) the control-section of the supergraph, 
(b) the control-state table. . . . . . . . . . . . . . . . . . . . 128 
6.8 Chip formation: (a) the supergraph, (b) the chip structure. 130 
6.9 Chip formation with multiple datapaths: (a) the supergraph, (b) 
the chip structure. . . . . . . . . . . . . . . . . . . . . . . . . 131 
6.10 The multi-bus datapath with a two-phase dock architecture: (a) 
control/ data paths with two-pipe stage and three-pipe stage with 
latch insertion ( dash boxes ), (b )two-phase-clock/two-pipe-stage scheme. 
134 
6.11 The schedule of the CDFG example in Figure 6.3(b ). . ....... 136 
6.12 Datapath formation: (a) the supergraph, (b) the structural netlist. 137 
6.13 The final supergraph. . ........................ 138 
6.14 Chip formation. . ........................... 139 
6.15 Hypergraph modification: (a) the supergraph, (b) the structure. . 142 
6.16 A unified view: (a) the system hierarchy, (b) a system to module 
v1ew. 144 
7.1 Superedge merging by node relocation: (a) before, (b) after. . ... 152 
7.2 Node swapping. . ........................... 153 
7.3 Interchange by considering interdependent relationship between op-
eration and variable assignments. . . . . . . . . . . . . . . . . . . . 154 
7.4 The register-to-register delay path: (a) var node insertion, (b) the 
structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 
7.5 The back-annotation example: (a) DFG and schedule, (b) opera-
tion/variable assignments, (e) var node insertion, (d) the graph, (e) 
the structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 
7.6 The layout of the back-annotation example (a) routing track assign-
ments, (b) the final layout. . . . . . . . . . . . . . . . . . . . . . . 161 
7. 7 (a) Back-annotation of wire lengths and component delays, (b) Back-
annotation of dela y information to the D FG. . . . . . . . 162 
7.8 The schedule of the 19-step Elliptic Filter benchmark. 167 
7.9 The structural netlist with 10 registers implementation. . 168 
7.10 The structural netlist with 11 registers implementation. . ..... 168 · 
7.11 The structural netlist with 12 registers implementation. . ..... 169 
7.12 The structural netlist with 13 registers implementation. 169 
7.13 The data path layouts of a 16-bit elliptic filter example: (a) archi-
tecture 1, (b) architecture 11. . . . . . . . . . . . . 170 
7.14 The results of the Elliptic Filter example: part l. .......... 171 
7.15 The results of the Elliptic Filter example: part 2. . ......... 172 
7.16 The final layout of a 16-bit elliptic filter example with PLA imple-
mentation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 
7.17 The final layout of a 16-bit elliptic filter example with random-logic 
implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 
Vlll 
7.18 The overall area estimation of the elliptic filter example. . . · .... 174 
7.19 The area-time curve of four di:fferent design of the elliptic filter 
benchmark: (a) table, (b) AT-curve. . ................ 175 
IX 
l 
¡ 
Acknowledgments 
First and foremost, 1 would like to thank my advisor Professor Daniel D. 
Gajski for his guidance, faith and support throughout my study. I would also like 
to thank the other members of my doctoral committee, Professor Nikil D. Dutt 
and Professor Fadi J. K urdahi. 
1 would also like to thank Professor Steve Lin to encourage me to continue 
pursuing my PhD. Otherwise, I might be still staying in Tucson, Arizona and 
enjoying my nine-to-five job plus hiking and fishing on the weekend. 
I would like to acknowledge my colleagues in CADLAB. They include Roger 
Ang, Viraphol Chaiyakul, Gwo Dong Chen, Jim Frankin, Jie Gong, Tedd Hadley, 
Pradip Jha, Kenichi Kanehara, Young Kim, Jim Kipps, Joe Lis, Sanjiv N arayan, 
,Loganath Ramachandran, Elke Rundensteiner, Frank Vahid and Neis Vander Zanden. 
I thank you for all your valuable discussion, assistance and friendship. I would also 
like to thank Bob Larsen of Rockwell International for his valuable discussions and 
suggestions. I like to extend my gratitude to my partner Viraphol Chaiyakul for 
his invaluable contribution to the quality-measure research. 
Finally, 1 would like to thank the support I received from my family. 
This work was supported by NSF grant üMIP-8922851 and California MICRO 
grant U90-046. I am grateful for their support. 
X 

Curriculum Vitae 
Chung-Hao (Allen) Wu 
EDUCATION: 
• 1992: Ph.D. in Information and Computer Science at the University of 
California, Irvine. 
(Dissertation Title: "Integration of Behavioral and Layout Synthesis: A Chip 
Synthesis Approach"). 
• 1985: M.S. in Electrical and Computer Engineering at the University of 
Arizona, Tucson. 
(Thesis Title: "A Microprocessor-Based Ultrasonic System for Measuring 
Bladder Volumes"). 
• 1,983: B.S. in Electronics Engineering at the N ational Taiwan Institute of 
Technology, Taipei, Taiwan. 
• 1977: Diploma in Electronics Engineering at the Min-Hsin Institute of Technology, 
Hsinchu, Taiwan. 
EXPERIENGE: 
• 1988-1992: Research Assistant, Information and Computer Science, University 
of California, Irvine. 
• 1986-1988: Electrical Engineer, Physiology Department, University Medical 
Center, University of Arizona. 
• 1983-1985: Research Assistant, Electrical and Computer Engineering, University 
of Arizona, Tucson, Arizona. 
• 1977-1980: Design Engineer, Dahsen Electronic Co., Taipei, Taiwan. 
PUBLICATIONS: 
• Conference and Journal Papers: 
l. N. Vander Zanden, Allen C-H Wu, and D. D. Gajski, "Technology 
Mapping with Layout Constraints," lnt Sym. on VLSI Technology, 
Systems and Applications, Taiwan, 1989. 
XI 
2. Allen e-H Wu, N. Vander Zanden, and D. D. Gajski, "An Algorithm for 
Transistor Sizing in eMOS eircuits," The European Conf. on Design 
A utomation, Glasgow, Scotland, 1990. 
3. Allen e-H Wu, G. D. ehen,and D. D. Gajski, "Silicon eompilation from 
Register-Transfer Schematics," Int Sym. Circuits and Systems, 1990. 
4. Allen e-H Wu and D. D. Gajski, "Partitioning Algorithms for Layout 
Synthesis from Register-Transfer Netlists," Int Con/. Computer-Aided 
Design, 1990. 
5. Allen e-H Wu and D. D. Gajski, "Glue-Logic Partitioning for Floorplans 
with A Rectilinear Datapath," The European Conf. on Design A utomation, 
1991. 
6. N. Vander Zanden, Allen e-H Wu, and D. D. Gajski, "Layout Synthesis 
for eustom Layout," The European Con/. on Design A utomation, 1991. 
7. Allen e-H Wu, G. D. ehen and D. D. Gajski, "Evaluation Driven 
Layout Synthesis," Int Sym. on VLSI Technology, Systems and Applications, 
Taiwan, 1991. 
8. Allen e-H Wu, V. ehaiyakul, and D. D. Gajski, "Layout Models for 
High-Level Synthesis," Int Con/. Computer-Aided Design, 1991. 
9. Lawrence L. Larmore, D. D. Gajski, and Allen e-H Wu, "Layout Placement 
for Sliced Architecture," IEEE Trans. on Computer-Aided Design, vol. 
11, no. 1, January, 1992. 
10. Allen e-H Wu and D. D. Gajski, "Partitioning Algorithms for Layout 
Synthesis from Register-Transfer Netlists," IEEE Trans. on Computer-
Aided Design, vol. 11, no. 3, March, 1992. 
11. D. D. Gajski, N. Dutt, e-H Wu and Y-L Lin, High-Level Synthesis: 
Introduction to Chip and System Design, Kluwer Academic Publishers, 
1992. 
12. D. D. Gajski, Allen e-H. Wu, and Viraphol ehaiyakul, "Layout Synthesis 
and Layout Models for Synthesis," SASIMI'92, 1992. 
13. Viraphol ehaiyakul, Allen e-H Wu and D. D. Gajski, "Timing Models 
for High Level Synthesis," EURO-DAC'92, 1992. 
• Technical Reports: 
l. Allen e-H Wu, N. Vander Zanden, and D. D. Gajski, "An Algorithm for 
Transistor Sizing in eMOS eircuits," Tech. Report 89-04, IeS Dept., 
u.e. Irvine, 1989. 
2. N. Vander Zanden, Allen e-H Wu, and D. D. Gajski, "Performance 
Optimization in Layout Driven Synthesis," Tech. Report 89-21, IeS 
Dept., U.e. Irvine, 1989. 
Xll 
3. Allen C-H Wu and D. D. Gajski, "SLAM: An Automated Structure to 
Layout Synthesis System," Tech. Report 89-40, ICS Dept., U.C. Irvine, 
1989. 
4. D. D. Gajski, Joseph Lis, N. Vander Zanden, and Allen C-H Wu, 
"Synthesis from VHDL: Rockwell-Counter Case Study," Tech. Report 
90-09, ICS Dept., U.C. Irvine, 1990. 
5. Allen C-H Wu and D. D. Gajski, "A New Partitioning Approach for 
Layout Synthesis from Register-Transfer Netlists," Tech. Report 90-10, 
ICS Dept., U.C. Irvine, 1990 
6. Allen C-H Wu "Survey of Partitioning Techniques in Silicon Compilation,'' 
Tech. Report 91-15, ICS Dept., U.C. Irvine, 1991. 
7. Allen C-H Wu, Viraphol Chaiyakul and D. D. Gajski, "Back-Annotation 
for Interactive Data Path Synthesis,'' Tech. Report 91-29, ICS Dept., 
u.e. Irvine, 1991. 
, 8. Allen C-H Wu and D. D. Gajski, "Layout-Driven Allocation for High-
Level Synthesis," Tech. Report 91-30, ICS Dept., U.C. Irvine, 1991. 
9. Allen C-H Wu, Viraphol Chaiyakul and D. D. Gajski, "Layout Area 
Models for High Level Synthesis,'' Tech. Report 91-31, ICS Dept., U.C. 
Irvine, 1991. 
10. Allen C-H Wu, Joe Lis and D. D. Gajski, "Partitioning-Based Algorithm 
for Pipelined Scheduling and Module Assignment," Tech. Report 91-32, 
ICS Dept., U.C. Irvine, 1991. 
11. Viraphol Chaiyakul, Allen C-H Wu and D. D. Gajski, "Timing Models 
for High Level Synthesis,'' Tech. Report 91-70, ICS Dept., U.C. Irvine, 
1991. 
12. Tedd S. Hadley, Allen C-H Wu and D. D. Gajski, "An Efficient Multi-
View Design Model for Real-Time Interactive Synthesis,'' Tech. Report 
92-35, ICS Dept., U.C. Irvine, 1992. 
Professional Activities: 
• Reviewer for Design Automation Conference (DAC). 
• Member IEEE, ACM. 
Xlll 

Abstract of the Dissertation 
Integration of Behavioral and Layout Synthesis: 
A Chip Synthesis Approach 
by 
Chung-Hao (Allen) Wu 
Doctor of Philosophy in Information and Computer Science 
University of California, Irvine, 1992 
Professor Daniel D. Gajski, Chair 
Chip synthesis deals with the transformation of a behavioral description in.to 
a fabricated chip. Typically, chip synthesis is carried out in three stages: be-
havioral, logic/sequential and layout synthesis. Since chip synthesis involves a 
multi-level synthesis task, integration and coordination of tasks for all levels of 
synthesis is the essential issue. 
This dissertation addresses a chip synthesis paradigm and describes the key 
issues with regard to the integration of behavioral and layout synthesis for chip 
design. In order to successfully integrate all tasks in the chip synthesis process, a 
finite-state machine with a datapath (FSMD) design model anda sliced-layout ar-
chitecture have been developed for chip synthesis. Using the sliced-layout architec-
ture, a partitioning-based layout synthesis method and system have been developed 
to synthesize layout from generalized register-transfer (RT) netlists. In addition, 
based on the FSMD and the sliced-layout architecture, area and timing models 
are developed for behavioral synthesis. To incorporate layout information into be-
havioral synthesis, a unified representation is developed for behavioral synthesis. 
XIV 
Using the unified representation and layout model, a layout-driven unit-binding 
approach is presented. Severa! sets of experiments were performed to validate the 
proposed approaches including the layout-synthesis method, the layout model and 
the layout-driven unit-binding task. 
XV 
Chapter 1 
Introd u et ion 
Microelectronic technology has advanced tremendously in the past decade. 
In the early 1990s, . VLSI technology reached the chip density with up to one 
million transistors. Chips of such complexity are very difficult to design by human 
designers. In order to successfully exploit new VLSI technology, design automation 
at the chip level is needed to facilitate the chip design process and to shorten the 
time-to-market cycle. 
1.1 Chip Synthesis Overview 
Chip synthesis deals with the transformation of a behavioral description into 
a fabricated chip. Typically, chip synthesis is carried out in three stages: behavioral 
synthesis, logic/sequential synthesis and layout synthesis, as shown in Figure 1.1. 
Behavioral synthesis transforms a given behavioral description into a struc-
tural description. Figure 1.2 shows a generic behavioral-synthesis system for chip 
design [GDWL92]. The system consists of a compiler, a scheduler, a number of 
allocators, a module selector, a component database (CDB), a technology map-
per and a microarchitecture optimizer. The compiler compiles the input behav-
1 
2 
Behavioral synthesls 
Logic/sequential 
s nthesis 
Layout synthesis 
F abricated chip 
Figure 1.1 : Chip synthesis. 
ioral description into a design representation such as a control/ data flow graph 
( CDFG). The Scheduler assigns operations into control steps. Storage, Functional 
and Interconnect unit allocators bind variables to registers and memories, operators 
to functional units and data transfers to buses. The Module selector determines 
the RT components to be used in the design, whereas the Storage merger groups 
registers into register files or groups a small memory set into a large memory. 
The component databa.se ( CDB) manages the RT component library and answers 
queries of component information in the synthesis process. The Technology mapper 
assigns real components from the CDB to the generic components in the structural 
description. The Microarchitecture optimizer improves the area and delay of the 
structural design. 
Logic/sequential synthesis transforms a structural description generated from 
a behavioral-synthesis system into a hardware description for layout synthesis. A 
structural description usually includes a control-state table, a datapath RT netlist 
( or a set of Boolean expressions ), an interface specification with timing or wave-
form diagrams, memory sizes and I/ O ports information. Since each component 
Behavioral 
description 
' Compilar 
Technology 
mapper 
Microarchitecture 
optimizar 
Scheduler 
Storage 
allocator 
Storage 
merger 
FunctionaJ unit 
allocator 
lnterconnection 
allocator 
Module 
selector 
CDB 
Logic/SequentiaJ synthesis 
To physical design 
! 
3 j 
~ 
< 
Figure 1.2: A behavioral-synthesis system for chip design. 
3 
4 
r-----~================~---~-- ~ 
State : Boolean 1 Tim1ng 1 
~ _ !a!>'.!!. 2 expressions • diagrarns 1 
·--r-· 
Sta te 
encoding 
Timing graph 
compiler 
Interface 
synthesis 
Physical design 
Memory 
specifications 
Memory 
synthesis 
Figure 1.3: A logic-synthesis system for chip design. 
in a structural description has different characteristics, different synthesis tech-
niques require to apply each of these components. Figure 1.3 shows a hypothetical 
logic-synthesis system that is capable of performing control synthesis, functional 
synthesis, interface synthesis and memory synthesis. 
Control synthesis converts an input control-state table to a control unit. The 
control unit can be implemented as a two-level logic using a PLA or a multiple-
level logic using standard cells. Control synthesis is usually carried out in four 
stages: State minimization, State Encoding, Logic minimization and Technology 
mapping (Figure 1.3). State minimization reduces the number of states in the 
control-state table by eliminating redundant states. State encoding assigns binary 
codes to symbolic states. Logic minimization reduces the cost and improves the 
delay of the control logic. Technology mapping maps the control unit to a target 
5 
RT netfist 
Style Component 
selection Partitioning instances 
from CDB 
1D 20 1Bit 
Floorplanning 
Routing 
To ASIC manufacturing 
Figure 1.4: A layout-synthesis system for chip design. 
implementation such as a PLA or random gates. Similarly, Boolean expressions for 
a functional unit can be synthesized using the logic minimization and technology 
mapping stages. Interface synthesis converts a given timing diagram into a set of 
random gates, latches and flip-fiops to satisfy the given communication-protocol 
r.equirement . Memory synthesis generates a memory description to satisfy the 
given requirements such as size, access time and access protocol. 
Layout synthesis transforms a RT netlist generated by a behavioral-synthesis 
system into a fabrication description of the chip. Figure 1.4 shows a representation 
layout-synthesis system for chip design. This system first partitions RT compo-
nents into two groups, Stack and Glue logic, according to their style, dimension 
and connectivity. For instance, two-dimensional components, such as multipliers, 
6 
memories and register files, are implemented as macrocells. One-dimensional com-
ponents, such as a 16-bit register and a 16-bit ALU, can be grouped together and 
implemented as a bit-sliced stack. The system then performs a floorplanning proce-
dure to determine the size and orientation of layout blocks such as bit-sliced stack, 
macrocells and standard-cell blocks. Finally, each block is laid out independently 
and a global router is used to complete the connections between blocks. 
1.2 Problem Description 
Since chip synthesis is a multi-level synthesis task, integration and coordi-
nation of tasks for all levels of synthesis is the essential issue in chip synthesis. 
In order to successfully integrate all tasks in chip synthesis, it has the following 
primary requirements: a consistent and well-defined target architecture for all syn-
thesis levels, a RT-based layout-synthesis method, a unified design model for all 
synthesis levels and the technique of incorporating layout synthesis into behavioral 
synthesis. 
It is difficult to develop a general purpose synthesis system that will produce 
high-quality results for a variety of target applications. Thus, existing synthesis 
systems have focused their capabilities on a selected target architecture or applica-
tion to reduce the complexity of the design process. A consistent and well-defined 
target architecture is required in chip synthesis. This allows the tools in different 
synthesis-level to use a consistent architecture and thus the coordination between 
different synthesis tasks can be established. Furthermore, to carry out the layout 
1 1 
l 
1 
1 
7 
generation in chip synthesis, a layout-synthesis method is needed to transform a 
RT netlist into a chip layout.· 
In order to integrate the tools in all synthesis levels, a unified design model 
is needed. U sing such a model, synthesis tools can retrieve layout information in 
any design stage and use this information to guide the design process. Typically, 
there are two approaches to incorporate layout information into behavioral-level 
synthesis. The first approach uses sorne simple layout model to estímate design 
quality. The drawback of using simplified layout models is low accuracy of ob-
tained estimates. The second approach feeds back the actual layout information 
by performing layout synthesis. This approach does provide accurate estimates; 
however, it is too slow for nontrivial designs. To provide a fast and accurate qual-
ity measure in chip synthesis, an accurate layout model that includes area and 
timing models is required. Furthermore, the technique that combines all models, 
including layout and design models, for chip synthesis is needed. 
1.3 Contributions 
This dissertation describes a number of important issues involved in chip 
synthesis (Figure 1.5). A design model called finite-state machine with a datapath 
(FSMD) and a layout architecture called sliced-layout architecture are presented. 
Using the sliced-layout architecture, a partitioning-based layout-synthesis method 
and system will be discussed (Figure 1.5( a)). Based on the FSMD and the slic~d­
layout architecture, area and timing models are developed for behavioral synthesis 
(Figure l .5(b)). To incorporate layout information into behavioral synthesis, a 
8 
, : : séiii;¡~~1-cia~riP'ti0ñ --: , 
---~~:::::T..:::::~~~ -, Design modal 
: Beh~yioral synthesis : 
' -----·- -- -I --------- .. 
·--------------------· 1 Logic/sequentlal • 
~:::::: ~:1::::::::: :: 
: Layout synthesis ' 
• -= = = = = = = = =t:: = = = = = = =-·~ltlonlng technlqiV 
': _ Fabricated chip : , ...... 
----------------- (a) 
Area modal 
-----------
--- ---
... 
~-- ---
------- -- -- -
Behavior 
--------- --------~ 
'... Layout architecture ,' 
~------- --------- -----------------
---
------------
Layout 
Timing modal 
(b) (e) 
Path1 
(d) 
Figure 1.5: The essential issues in chip synthesis: (a) the design model and layout 
architecture, (b) the area/timing model, (e) a unified design model, ( d) layout-
driven behavioral synthesis. 
9 
unified design model is developed for behavioral synthesis (Figure 1.5( e)). Two 
different layout-driven unit-binding approaches, using feedback layout information 
(Path2) ·and using layout model (Pathl), are presented (Figure l.5(d)). The novel 
contributions of this work are described below. 
• The sliced-layout architecture. We have modelled chips using the FSMD 
model. To realize the FSMD model, we developed a sliced-layout architecture 
for compilation of generalized RT netlists into layout for CMOS technology. 
This sliced-layout architecture combines over-the-cell routing and bit-sliced 
stack folding to produce :flexible and high-density layout. 
• Partitioning-based layout synthesis for RT netlists. U sing the sliced-
layout architecture, we developed a partitioning approach for layout synthesis 
from RT netlists that exploits the regularity of RT components. Three parti-
tioning algorithms, component partitioning, stack partitioning and glue-logic 
partitioning, were developed to generate the chip floorplan. U sing these algo-
rithms, a layout-synthesis system (SLAM) was developed to bridge the gap 
between behavioral and layout synthesis tools. 
• Area and timing models for behavioral synthesis. To obtain more 
realistic quality measures for behavioral synthesis, a layout model, including 
area and timing models, is developed. This layout model takes into account 
most technology factors such as layout architectures and technology mapping, 
and thus provides more accurate estimates than previous proposed models. 
• A unified design representation. A unified model was developed to 
bridge the gap between behavioral and layout synthesis. This model pro-
vides three important features for incorporating layout information into be-
havioral synthesis. First, this model encapsulates both the structure and the 
10 
behavior of the design. Second, this model provides a structural hierarchy 
of chips. Third, this model provides a unified behavior/structure view of 
the design that is well suited for interactive synthesis. Using this proposed 
unified model, synthesis tools can retrieve layout information at any design 
leve!. 
• Incorporation of layout information into behavioral synthesis. In 
order to incorporate layout information into beh_avioral synthesis, two meth-
ods, using the layout model and using the feedback layout information, were 
developed for behavioral synthesis. To provide a fast and accurate area mea-
sures in the datapath design process, we developed a new approach that uses 
our proposed layout model for datapath optirnization. In order to use the 
area rnodel, we model the datapath as a graph representation that explicitly 
reflects the datapath floorplan. To take advantage of layout inforrnation we 
formulate datapath binding as a graph partitioning problern. Contrary to 
the other datapath optirnization algorithrns which rninirnize the nurnber and 
size of registers and muxes, our algorithrn evaluates layout-area quality dur-
ing datapath optimization. Our approach provides faster and more accurate 
area quality rneasures for datapath optirnization than previous approaches. 
1 
1 
11 
1.4 Thesis Overview 
This dissertation is organized as follows. Chapter 2 surveys related work 
in behavioral and layout synthesis and their applications. Chapter 3 presents 
the target architecture. Chapter 4 describes a layout-synthesis system for layout 
generation from generalized RT netlist. Chapter 5 presents the area and timing 
quality mea.sures in behavioral synthesis. Chapter 6 describes a unified model for 
behavioral synthesis. Chapter 7 presents the layout-driven unit-binding approach. 
Chapter 8 summarizes the accomplishments of this research and outlines future 
work. 

Chapter 2 
Related Work 
This chapter surveys related work in the area of behavioral and layout synthe-
sis in particular the methods of incorporating layout information into behavioral 
synthesis and the linkage between layout and synthesis tools. 
In this chapter, Sections 2.1, 2.2, and 2.3 describe three different approaches 
to incorporate layout information into behavioral synthesis. BUD [McFa86, McKo90] 
uses a layout model and a floorplanner to provide physical information for mod-
ule selection in behavioral synthesis. Chippe [BrGa90] and Fasolt [Knap89] use a 
feedback-driven technique to guide design tasks in behavioral synthesis. Section 
2.4 describes LASSIE [TrDi89] that is a layout-synthesis tool used to generate 
layout for behavioral-synthesis tools. Sections 2.5 and 2.6 describe two complete 
CAD systems for chip design: Cathedral [DRSC86, NGCD91] focuses on DSP chip 
design and LAGER [RaPB85, Shun91] focuses on algorithm-specific chip design. 
Finally, Section 2. 7 summarizes the previous approaches. 
12 
13 
2.1 BUD 
BUD (McFa86, McKo90] introduces physical-level information to guide be-
havioral-synthesis tasks, in particular, for the module selection. It uses a hierar-
chical clustering technique to partition operations in the control/ data :flow graph 
into clusters based on their similarity and concurrency. BUD first forms a hier-
archical clustering tree. In such a tree, the leaves represent the operations to be 
clustered, and the number of non-leaf nodes between two leaf nodes defines the 
relative similarity between these two operations. 
The cluster tree guides the search of the design space. A series of di:fferent 
module (i.e., functional unit) sets is generated by starting with a cut-line at the root 
an_d by moving the cut-line toward the leaves. Any cut across the tree partitions 
the tree into a set of subtrees, each of which represents a cluster. For each cut-line 
a di:fferent module set is formed and the resulting design is evaluated. Thus, the 
first design has all operations in the same cluster, sharing the same tightly coupled 
datapath. The second is made up of two clusters that have the greatest distance 
between them, and so on. This process continues until a stopping criterion, such 
as the lack of improvement in the design for a predetermined number of module 
sets, is reached. 
Each time a new unit set is selected, the design is evaluated. First, the 
functional units required to execute all operations in each cluster are assigned to 
the design. Next, all operations are scheduled into control steps. After scheduling 
has been completed, the lifetimes of all values are computed and the maximum 
number of bits required to store those values between any two control steps is 
determined. The number of interconnections between each pair of clusters is also 
15 
2.3 Fasolt 
Fasolt [Knap89] is an interactive feedback-driven datapath synthesis tool that 
uses layout information to drive decision making in the scheduling and allocation 
tasks. Fasolt first constructs a schedule and a RT-level datapath structure. Then, 
it uses a layout estimator to perform macrocell-style placement and routing. The 
resulting layout is back annotated with geometric information. Fasolt uses this 
information to merge buses and to resolve scheduling conflicts caused by the bus 
merging. The process is repeated until there is no more improvement. 
2.4 LASSIE 
LASSIE [TrDi89] is a layout-synthesis system that synthesizes layout from 
structural netlists generated by the behavioral synthesis tools [TLWN90]. In 
LASSIE, the design process is divided into four steps: structural binding, partition-
ing, placement and routing. Structural binding maps the generic RT components 
onto a set of module generators that are available in the given technology such as 
bit-sliced stack or standard cells. When a stack contains a wide range bit-width of 
bit-sliced units, a large amount of empty space exists within the datapath bound-
ing box. Partitioning is used ·to cluster similar sizes of bit-sliced units together to 
form a stack and thus to reduce the empty space in the bounding box of a large 
stack. Partitioning is also used to divide a large datapath into multiple modules 
when the datapath is too large to fit into a chip. Finally, LASSIE performs place-
ment and routing procedures to produce datapath layouts. In essence, LASSIE 
14 
determined. With this information, the number and size of registers, multiplexers 
and wires within each cluster is known so that the length, width and area of each 
cluster can be estimated using an area model. An approximate floorplanner based 
on a min-cut partitioning algorithm is used to estímate the total chip area. 1/0 
pads are placed around the boundary of the chip for the global variables given 
in the behavioral description and for any off-chip memory. Finally, the datapath 
delays and the dock period are estimated using a timing model based on register-
transfer delay. The total area and clock-cycle time are used to compare the design 
with others that have been encountered in the search. 
2.2 Chippe 
Chippe [BrGa90] is an expert system that uses a closed loop design iteration 
technique to perform behavioral synthesis. Chippe uses a "knobs and gauges" 
approach to simplify the communication between experts and tools. Knobs specify 
the design goal su ch as area, performance and power. Gauges indica te a set of 
scalar quality measures of the present state of the design. The design process 
begins with an initial design. An area and timing model is used to evaluate the 
design quality. The quality measures are then feedbacked to the tools to guide the 
design decisions. The process is repeated until the given constraints are satisfied. 
Chippe uses a simple datapath and control-unit (PLA) layout model for area and 
delay measures but it does not consider the effects of wiring delay and area. 
16 
provides a framework and implements a structural binding (mapping) step that 
adapts the output of hehavioral synthesis tools to different datapath-layout styles. 
2.5 Cathedral 
Cathedral is a CAD system for DSP chip design. In the early vers1on, 
Cathedral 11 [DRSC86] uses a microcoded processor architecture that consists of 
a number of customized highly programmable datapaths called execution units. 
The execution units are connected by dedicated buses and controlled by a mi-
crocoded controller. This architecture has very high programmability; however, it 
is only suited for low-speed DSP chips. In order to achieve high-speed DSP design, 
Cathedral 111 [NGCD91] uses a hardwired lowly multiplexed datapath architecture 
that can adapt different-style of functional units to satisfy different performance 
requirements. 
Cathedral takes a DSP algorithm as inputs, performs memory allocation, dat-
apath scheduling and allocation, and then generates a structural output. Finally, 
the datapaths are generated using a datapath layout assembler [CNSD90] and the 
control unit is implemented using a PLA or standard cells. 
2.6 LAGER 
LAGER [RaPB85, Shun91] is a CAD system for algorithm-specific IC design. 
LAGER consists of two main subsystems: a behavioral mapper and a silicon as-
sembler. The behavioral mapper contains three component: a high-level compiler 
17 
(Silage (Hilf85]), a control compiler (RL [RiHi88]) and a control mapper (Sass ). 
Silage transforms a signa! processing like algorithm onto a predefined architecture 
that contains one or more datapaths and one control unit. RL and Sass gener-
ates a control-sequence for the target control architecture. The silicon compiler 
transforms the output generated by the behavioral mapper onto final layout. The 
silicon complier first generates a netlist for each functional unit, then simulates the 
netlist, followed by the layout generation, verification and performance estimation, 
and finally generates the mask file. In their implementation, the datapath is im- · 
plemented using a bit-sliced stack while the control unit is implemented using a 
PLA or standard cells. 
2.7 Summary 
In summarizing the existing approaches, the following observations can be 
made: 
• There are two approaches to incorporate layout information into behavioral 
synthesis. The first approach uses sorne simple layout model to estimate 
design area quality. The drawback of using simplified layout models is low 
accuracy of obtained estimates. The second approach feeds back the ac-
tual layout information by performing layout synthesis. This approach does 
provide accurate estimates; however, it is too slow for nontrivial designs. 
• There is no complete and well-defined layout-synthesis tool for layout gener-
ation from generalized RT netlist. 
18 
• The existing chip-synthesis systerns use a top-clown design rnethodology tar-
get toward sorne application-specific architecture. One drawback of using the 
top-clown design rnethodology is that the layout can be obtained only after 
the cornpletion of the higher level tasks including behavioral synthesis and 
logic synthesis. In the other words, higher-level synthesis tools have to rnake 
design decisions without layout inforrnation or using sorne sirnplified layout 
rnodel. 
Frorn these observed shortcorning of current efforts to chip synthesis, the 
following goals of the research presented in this dissertation can be stated: 
• Definition of a general target architecture for various applications. 
• Developrnent a layout-synthesis technique and systern for layout generation 
frorn generalized RT netlists. 
• Definition and developrnent of an accurate layout rnodel that provides fast 
and accurate estirnates for behavioral synthesis. 
• Developrnent of various layout-driven techniques for behavioral synthesis. 
The rernainder of this dissertation will present the details of the approach 
that has been taken to achieve these goals. 
1 
l 
I 1 
1 
Chapter 3 
Target Architecture 
This chapter presents the design model called finite-state machine with a 
datapath (FSMD) and the layout architecture called slice-layout architecture. 
The remainder of this chapter is organized in the following manner. Section 
3.1 describes the FSMD design model. Section 3.2 presents the layout architecture 
in which Section 3.2.1 describes our motivation by introducing several commonly 
used layout architectures including standard cells, bit-sliced macros with abutment 
and bit-sliced macros with channel routing; Section 3.2.2 presents the sliced-layout 
architecture. Finally, Section 3.3 summarizes our target architecture. 
3.1 Finite-State Machine with a Datapath 
In computer scienceand engineering, finite-state machine (FSM) is one of the 
most popular design models. The FSM model usually works well for a small set 
of states (a few to several hundred states). However, for a complex design ( up to 
several thousand states), the FSM model becomes too complicated to comprehend 
by human designers. In order to make the FSM model usable for more complex 
19 
20 
Control inputs 
.- -----------------------
1 
1 
State reg. 
Next-state 
logic 
Control unit 
Output 
Logic 
~-----------------------------· 
Control outputs 
Datapath 
control 
status 
Datapath inputs 
Datapath 
Datapath outputs 
Figure 3.1: Generic FSMD block diagram 
designs, Gajski et al. (GDWL92] extend the FSM model to a finite-state machine 
with a datapath (FSMD) model. 
FSMD model is particularly useful to describe digital systems on the register-
transfer level. Figure 3.1 shows a generic FSMD implementation that consists of 
a control unit and a datapath. Typically, a datapath contains a set of register-
transfer components to perform data processing tasks. The control unit consists 
of a state register, next-state logic and control output logic that specify the con-
trol signals for the datapath to execute data computations in each state and to 
determine the sequences for the next state. 
21 
3.2 Layout Architecture 
To realize the FSMD design model, this section presents the sliced-layout ar-
chitecture for physical design. Section 3.2.1 discusses the motivation and overviews 
the other layout architectures. Section 3.2.2 describes the sliced-layout architec-
ture. 
3.2.1 Motivation 
Surveys of VLSI products revea! that most fabricated chips can be described 
by register-transfer (RT) schematics or netlists. In addition to gates, latches and 
flip-:flops, schematics include RT components such as registers, counters, adders, 
AL U s, shifters, multiplexers, bus drivers and register files. The products in this 
category, including DMA controllers, bus controllers, disk controllers and pro-
grammable I/O interfaces, fit well into the FSMD model. 
The commonly used layout architecture for such designs is standard cells. 
U sing the standard-cell architecture, the chip microarchitecture is decomposed 
into gates, such as NAND, NOR, AND and OR gates, and storage components, 
such as latches and flip-:flops. Each of these elements is laid out manually or using 
a cell generator. All cells have the same height but different width. The cells are 
placed in rows and every two rows are separated by a routing area called routing 
channel. Standard-cell architecture usually uses excessive routing area and does 
not take advantage of the replicability of RT components that consists of many 
identical bit slices. Each bit-slice can be laid out as one cell instead of severa! 
standard cells. Area reduction is obtained due to several reasons: (1) the logic 
22 
unit1 
unit2 
unit3 , 
~ 
~"''1 
~ unit4 
unitS 
2 3 4 5 6 7 8 
.. 
.. 
~ 1 
~,-..-..:. 
..... 
~"'-..:. 
t':: 
.. 
L~" 
l·f'" 
¡..' "' 
~ : empty cell. 
(a) 
... .. • 
o 
I~ 
ver-the-cefl 
routing 
(b) 
Figure 3.2: Two datapath layout architectures: (a) bit-slice abutment, (b) bit-
sliced macros with channel routing. 
of the bit-slice can be better optimized by use of complex gates, (2) transistors 
can be better packed and connected, and (3) connections can be achieved through 
abutment instead of through the wiring in the channel. Performance improvement 
is obtained through reduced wire length and proper transistor sizing. 
In order to exploit the regularity of RT components, two commonly used 
datapath layout architectures have been developed. The first layout architecture 
uses abutment to connect different bit slices and over-the-cell routing for connecting 
different units inside one bit slice. This layout architecture is widely used in micro-
processor [JaJe85, Joha79, LuDe89, Sout83] and DSP chips [PWSE86, RDVG88]. 
For such datapath-oriented designs, a chip will contain one or more datapath stacks 
and one control unit. This architecture, however, wastes area if units with varying 
bit-widths are in the same stack, as commonly found in designs like DMA and 
bus controllers. Furthermore, the connection of bit-slices with different indices is 
difficult. For example (Figure 3.2( a)), when 4-bit and 8-bit units are laid out in 
· J 
23 
this bit-sliced style, 4-bit slices are wasted for each 4-bit unit. In addition, if bits 7 
and 8 of unit1 must be connected to bits 1 and 2 of unit3 (Figure 3.2( a)), routing 
across bit-slices rnust be introduced. 
The second layout architecture stacks bit-sliced macros vertically with rout-
ing channels between units, as shown in Figure 3.2(b ). Using this layout architec-
ture, severa! units with srnaller bit-widths can be placed in the sarne row in order 
to reduce wasted space. However, routing channels for wire connections between 
the units contribute to low area utilization. 
To alleviate the problerns of rnentioned architectures, in the next section we 
describe a new layout architecture called the sliced-layout architecture. 
3.2.2 Sliced-Layout Architecture 
The sliced-layout architecture uses a stack of bit-sliced RT units. Figure 3.3 
shows a bit-sliced unit that has the sarne width and a fixed nurnber of second-
rnetal routing tracks over each unit (13 tracks in our irnplernentation). The unit's 
height, on the other hand, varíes with the unit's functionality. Using the over-the-
cell routing strategy, the data signals run vertically in second-rnetal layer over the 
bit slices. Power, ground, carry and control lines are routed horizontally in the 
first-rnetal or polysilicon layér between the bit slices. The layout of each unit is 
designed rnanually. The stack layout is produced by a pararneterizable generator 
[WuCG90] according to the given bit-width and I/O pin positions. The stack grows 
horizontally when the bit-width increases and grows vertically when the nurnber 
of units increases. A four-bit ALU stack is shown in Figure 3.4. 
24 
Vd d 
Gn d 
Vdd 
Control line ........, 
Gnd 
Fixed width 
•••••••••••••••••• 
Fixed-position over-the-cell 
metal2 routing tracks 
Figure 3.3: The sliced unit structure. 
!! ·... 1 s. • . • . 
Blt el Bit l Bit 2 Bit 3 
li ..!:%1 i 
rr ¡ 1 1 . 
1 ~ ~ 1 ~ ;. ~ , ~ , ~ . 
1 I ~ ;. 1 ~ , 1 1 ·. 
4 1 1 ri ~ . 1 
1 
.q. 1 ¡. 1 ~ . 1 
. 1 1 1 • 1 
1 ~ ~ ;. 1 ~ I ~ 1 ;. 1 ~~ 1 • 1 1 . 1 I 
I~ I l 
- ¡.¡ ~ .• 1 -1 r~ ~ -1 ~+ 1 1 1 l 1 • ~ ~ 1 • 1 
r• l I ' ~ _i L:.. 
(El) (b) 
H 1t•f"Jt: ./L.•d. 11 11t•rJt 
W2 r Tl!:XT M•24 C-47'1!!1 Mil . S) boss1 L 
.. , 
~ 
1 
.E 
C> 
'Cii 
:e 
L ne: 
Figure 3.4: A four-bit ALU stack: (a) instance, (b) layout. 
I· 
25 
When several units of different bit-widths are stacked, units are ordered by 
bit-width from the top to the bottom and aligned by the least significant bit. 
Figure 3.5(a) shows an example consisting of three registers Regl, Reg2 and Reg3. 
The bit width of Regl is eight, while the bit widths of Reg2 and Reg3 are five and 
four, respectively. The :five least significant bits of Regl are connected to Reg2 
while the four most significant bits of Regl are connected to · Reg3. Figure 3.5(b) 
shows the floorplan that contains a step-shaped triangular area in the stack bound-
ing box. This area can be used for stack folding or for placement of the non-sliceable · 
glue-logic components (Figure 3.5(c)). Furthermore, a routing channel called a 
switch box will be inserted in the stack to connect bit-slices with different indices. 
In our example, the interconnections between Regl (bits 4-7) and Reg3 (bits 0-3) 
are not mutable without a switch box. Therefore, a switch box is inserted, as 
shown in Figure 3.5(c). Using the same switch box, signals may enter or exit the 
sliced stack from the left or right. 
In the sliced-layout architecture, the stack can be laid out in two different 
styles. If the netlist contains only a few sliceable components then we use an 
unfolded stack structure as shown in Figure 3.6( a-). In this case, the glue-logic 
· components are placed into the empty space in the stack bounding box. When 
more space is needed for glue-logic components, they are placed on the left, right, 
top, and bottom sides of the stack bounding box. On the other hand, if the stack 
contains a large number of sliceable units with highly varying bit widths, the stack 
structure can be folded as shown in Figure 3.6(b ). The glue-logic components are 
then placed at the sides of the stack bounding box. Furthermore, if the height of a 
folded stack is higher than a given height constraint, the stack will be partitioned 
into severa! stacks that can be either folded or unfolded. 
26 
(0-4) 
Reg2 [5] 
o 2 
Reg1 
Reg2 
" 
" Reg3 " •" 
o 2 
Reg1 
SWbox 
Reg2 
Reg3 
Reg1 [8] 
[0-7) 
[4-7] 
Reg3 (4) 
(a) 
3 4 5 6 7 
(b) 
3 4 5 6 7 
l l l l 
T I 
J 
,----------~ 
1 
1 1 
1 1 
1 Area for 1 
r - - -1 
1 
glue-logic 1 
' 
1 
1 1 
1 1 
'--------------' 
(e) 
Exit on 
the right 
Figure 3.5: Switch box insertion for wire alignment: (a) RT netlist, (b) floorplan, 
(e) switch box insertion. 
.-- --
.e cae 
a.-~ e 
-o ~(.) 
Stack bounding box 
------------------~-------------- --- --. 
1 
-
1--
~ 1- Unfolded stack 
-----------
1 
1-' ....., 1 
1 
J : 
---· 
• - - - - • Area for glue-logic 
-
1- 1 
1 
1 
1 
l --- .. ' 1
Area for glue-logic 
(a) 
Unf olded stack 
Folded stack 
Area for glue-logic 
(b) 
Figure 3.6: Two sliced-layout architectures: (a) unfolded stack, (b) folded stack. 
27 
28 
3.3 Summary 
This chapter presented the FSMD model that is capable of describing most 
digital systems on the register-transfer level. To realize the FSMD design model 
at the physical level, we proposed the sliced-layout architecture that combines 
over-the-cell routing, switch-box wire alignment and the stack folding method to 
produce flexible and high-density layout. The details of layout synthesis using the 
sliced-layout architecture are described in Chapter 4. Furthermore, the theoretical 
studies of different folding techniques, including linear folding, interleaved folding 
and unrestricted folding, are addressed in [LaGW91]. 
Chapter 4 
Layout Synthesis 
This chapter describe a layout synthesis system for layout generation from 
generalized register-transfer (RT) netlist. The system uses a new partitioning ap-
proach· and the sliced-layout architecture discussed in Chapter 3 to generate the 
layout by considering the component layout-style, floorplan, and critica! paths si-
multaneously. This improves the overall area utilization and minimizes the critica! 
wire lengths, which in turn yields better performance. 
The remainder of this chapter is organized in the following manner. Section 
4.1 describes the overview of the system. Section 4.2 presents the partitioner 
and floorplanner (SLAM). Sections 4.3, 4.4, and 4.5 describe three partitioning 
algorithms, component partitioning, stack partitioning and glue-logic partitioning, 
used in SLAM. Section 4.6 presents the experimental results. Finally, Section 4. 7 
concludes our approach. 
29 
30 
SLAM 
Layout 
system 
Figure 4.1: The system block diagram. 
4.1 System Overview 
ICDB 
Component 
library 
Component 
generator 
Area & timing 
estimator 
Our system [WuCG90] is performed in a top-clown fashion through three 
partitioning phases: (1) partitioning of the RT schematic into bit-slice and glue-
logic components, (2) partitioning of bit-sliced components into severa! bit-sliced 
stacks, and (3) partitioning of glue-logic components into groups to fit area blocks 
around the stacks. The system consists of three parts: SLAM [WuGa89], ICDB 
[ChGa90] and Layout system (Figure 4.1). SLAM partitions the netlist into bit-
sliced and glue-logic component sets. Each component set is partitioned further 
into clusters to form the final :floorplan. ICDB provides the partitioner and the 
floorplanner component information ( e.g., delay and area) and layout information 
1 
' 1 
31 
( e.g., aspect ratio and I/O positions) to select the best suited layout style for each 
component. The layout system consists of a bit-sliced stack generator, a glue-logic 
generator and a global router to generate final layout. 
4.2 SLAM 
SLAM is a partitioner and floorplanner that takes RT netlists (VHDL netlists) 
as inputs, partitions netlists into bit-slice and glue-logic groups, performs bit-sliced 
stack partitioning, dissects empty space into area blocks, assigns glue-logic compo-
nents to area blocks and forms the floorplan. Using the sliced-layout architecture 
d~scribed in Chapter 3, SLAM performs three partitioning phases: (1) component 
partitioning, (2) stack partitioning and (3) glue-logic partitioning to generate the 
final floorplan [WuGa90]. 
In the first phase, the component partitioner performs the component par-
titioning to separate component instances into sliceable or non-sliceable sets. All 
of the necessary information for each component, including type, area and de-
lay, is provided by the database ICDB. In the component partitioning, the algo-
rithm tries to minimize the total area by exploiting the regularity of components 
and selecting the best suited layout style for each component. For example, a 
regularly-structured component, such as an 8-bit adder, a register or a compara-
tor, is preferred to be laid out as a bit slice. On the other hand, consider a 
netlist that contains a 4-bit comparator performing an equal-to function, while 
the database contains a comparator bit-slice performing larger-than, less-than and 
equal-to functions. This comparator bit-slice consists of 160 transistors. However, 
32 
in this design we only need a comparator performing an equal to function that 
can be implemented using glue-logic components with 60 transistors. Thus, the 
glue-logic implementation for this comparator is less costly in terms of area and 
power consumption. 
In the second phase, the stack partitioner performs stack folding to minimize 
the layout area of bit-sliced components. Since bit-sliced units often have varying 
bit-widths, the sliced layout architecture generates an empty space within the stack 
bounding box. A folding method is used to place small units into the empty space 
and thus reduce the stack height. The stack-folding is a two-dimensional area-
filling process that considers both the bit-widths and the heights of the bit-sliced 
units. 
In the third phase, the f:loorplanner dissects the empty space around the 
stack into area blocks by considering the given aspect ratio and the size of the 
glue-logic unit. In addition, the floorplanner estimates the transistor sizes of each 
area block. The area estimation is provided by ICDB. Then, the glue-logic parti-
tioner uses a seed-based multiway partitioning to assign glue-logic components into 
area blocks. The seed-based multiway partitioning algorithm is an extension of the 
KLFM mincut partitioning algorithm (KeLi70, FiMa82]. In order to minimize the 
wire length on the critica! paths, the algorithm performs two clusterings: (1) com-
ponent clustering and (2) net clustering. In component clustering, the algorithm 
groups the components on the critica! path and assigns them to the area block 
with the highest interconnect closeness. The net clustering is an extension of the 
terminal propagation strategy (DuKe85] to take into account the external connec-
tions between blocks. During the partitioning process, the algorithm clusters a set 
of nets (i.e., seed nets) from each partitioning set. The algorithm assigns a higher 
33 
weight to the seed nets so that the components connected to the seed nets will be 
attracted to their connecting nets. 
Finally, each glue-logic block is generated using a layout generator [LiGa87] 
and the stack module is generated by generators using Mentor Graphic GDT tools. 
A global router then finishes the detailed routing between modules to generate the 
final layout. 
4.3 Component Partitioning 
The purpose of component partitioning is to determine the layout style ( e.g., 
bit-slice or glue-logic) for each RT component. The component layout-style de-
pends on the component type, the component connectiv:ity and the overall floor-
plan. First, components must be partitioned by type since sorne components, such 
as counters, registers and AL U s, are sliceable while others, such as decoders and 
encoders, are not. Second, small size components can be implemented in two ways. 
For example, a 2-bit ALU can be implemented using NAND and NOR gates oras 
a bit-sliced unit. The implementation decision for such a component depends on 
its connectivity. For instance, if a component in question is strongly connected to 
other glue-logic components, then a glue-logic layout style may be more suitable . 
for this component in order to reduce the wiring area between bit-sliced stack and 
glue-logic module. Third, the component layout style also depends on the final 
floorplan. For example, using the folded-stack architecture, small bit-sliced units 
are folded into empty space in the stack bounding-box which is described in the 
next section. If a small unit doesn't fit into the stack, this unit might better be 
34 
Input 
4 
(a) 
Clk (b) Reg (4) 
(e) 
4 
2 (d) 
Sel (e) Mux (4) 
(f) 
4 
Output 
Ports a, e, d and f: data port. 
Ports b and e : control port. 
(a) 
Input 
Output 
(b) 
Figure 4.2: Graph representation of the RT netlist: (a) RT netlist, (b) its graph 
representation. 
laid out using a glue-logic component in order to reduce the overall layout area. By 
exploiting the bit-slice property of RT components and selecting the best suited 
layout style for each component, the area utilization can be improved. 
A weighted and labeled undirected hypergraph G =< V, E > is formed from 
the schematic. V={ Vi 1 i = l..n }, denotes a set of components in the schematic, 
and comp_type( vi) denotes the component type of Vi, Glue_Logic or Bit..Slice. 
ui( vi) denotes the port j of Vi, while porLtype( ui( vi)) denotes the port type of 
ui(vi), a control port ora data port. E={eij,kl} denotes a set of edges where eij,kl 
is the edge between Uj( vi) and u1( vk)· In addition, w( eij,kl) denotes the weight of 
eij,kl which represents the number of wires between ui( Vi) and u¡( vk)· 
35 
A graph example generated from the schematic in Figure 4.2( a) is shown 
in Figure 4.2(b ). There are two components in the schematic, a 4-bit Reg and a 
4-bit Mux. In the graph, Reg and Mux form two nodes v1 and v2 • Each node has 
two data ports, {ua(v1),uc(v1)} and {ud(v2),u¡(v2)}, and one control port ub(v1) 
and ue( v2). In Figure 4.2(b ), eic,2d is the edge between uc( vi) and ud( v2) where 
w( eic,2d)=4. 
Two linking costs, Ccontrol( vi) and Cdata( vi), are used to evaluate the con-
nectivities between components. For a node i with m ports, the linking costs 
are 
m 
Ccontrol( Vi) = L w( eij,kl) 
j=l 
where porLtype( u¡( vk) )=control, and 
m 
Cdata( Vi) = L w( eij,kl) 
j=l 
where comp_type( vk)=Bit...Slice and port_type( u¡( vk) )=data. 
( 4.1) 
(4.2) 
The linking cost Ccontrol( vi) is the number of wires connected to Vi from other 
Glue_Logic nodes or from control ports of other Bit-8/ice nodes, while Cdata( vi) 
is the number of wires connected to Vi from data ports of other BiLSlice nodes. 
For example, if v1 in Figure 4.2(b) is a BiLSlice and u¡ ( v2) connects to another 
Bit-8/ice then Ccontrol( v2)=2 and Cdata( v2)=8. 
Algorithm 4.1 describes the component partitioning procedure. The in-
put to the algorithm is a RT netlist ( S) with n components. The procedure 
build_graph( S) builds the RT-netlist graph. The function area_estimation( vi) re-
turns the cheapest implementation of component Vi (i.e., Glue_Logic or Bit-8lice). 
The procedure link_cost( vi) returns the linking costs of component V¡. The com-
ponent partitioning is divided into four steps: 
36 
l. Initial component . type assignments. If bit-slice implementations are not 
available or the glue-logic implementation is cheaper for the nodes, the algo-
rithm first labels such nodes as Glue_Logic. Otherwise, the nodes that meet 
the following two conditions are labeled as Bit...Slice: (1) the component can 
be laid out as a bit-slice and (2) the component 's bit-width is larger than a 
user speci:fied threshold. If these conditions are not met, the algorithm will 
label nodes as "undecided" type. 
2. Type assignments for the undecided nodes. Since the undecided nodes can 
be laid out as a bit-slice or a glue-logic component, the algorithm takes into 
account the connectivity by calculating the linking cost of undecided nodes. 
For an undecided node v¡, if Ccontrol( v¡) > Cdata( v¡) then node Vi is labeled 
as Glue_Logic, otherwise node Vi is labeled as Bit...Slice. 
3. Final assignment. In this step, the algorithm re-evaluates the connectivities 
among nodes to finalize the component type assignments. The algorithm first 
calculates the linking cost for all of the nodes. Then the algorithm evaluates 
the connectivities of nodes that can be laid out as both Glue_Logic and 
Bit...Slice. For evaluating a node vi, there are three possible cases: (1) if 
Ccontrol( v¡) > Cdata( Vi) and node V¡ is a Bit...Slice then node Vi is re-labeled 
as Glue_Logic, (2) if Cdata( v¡) > Ccontrol( Vi) and node Vi is a Glue_Logic then 
node v¡ is re-labeled as Bit...Slice and (3) if conditions (1) or (2) do not apply, 
a node Vi keeps its original component type. 
4. Reassignment during folding. During the stack folding stage, the algorithm 
may re-label nodes as Glue_Logic if the components can not :fit into the 
bit-sliced stack. Details of this phase will be described in the next section. 
Algorithm 4.1. Component Partitioning. 
Let 
S be the register-transfer netlist; 
BW be the mínimum bit-width for a bit-slice implementation; 
C om ponenLP arti ti oni n g( S) { 
build_graph(S); 
/*Initial component type assignment* / 
for i = 1 to n do { 
}; 
if ( v¡ is not sliceable) then 
comp_type( v¡) = Glue_Logic; 
else if (Vi is sliceable) then { 
} 
type = area_estimation( v¡); 
if ( type = Glue_Logic) then 
comp_type( v¡) = Glue_Logic; 
else if (type = BiLSlice) then{ 
} 
if ( bitwidth( vi) > BW) then 
comp_type( v¡) = BiLSlice; 
el se 
comp_type( v¡) = undecide; 
/* Assign component type to the undecided nodes* / 
for i = 1 to n do{ 
} 
if ( comp_type( v¡) = undecide) then { 
} 
{ Ccontrol( v¡),Cdata( vi)} = link_cost( V¡); 
if (Ccontrol(v¡) > Cdata(vi)) then 
comp_type( vi) = Glue_Logic; 
else 
comp_type( v¡) = BiLSlice; 
/*Final assignment* / 
for i = 1 to n do{ 
{ Ccontrol( v¡),Cdata( v¡)} = link_cost( vi); 
if (Vi can be laid out as both BiLSlice and Glue_Logic) then{ 
37 
if ( Ccontrol( V¡) > Cdata( v¡) AND comp_type( V¡) = Bit..Slice) then 
comp_type( vi) = Glue_Logic; 
else if (Cdata(vi) > Ccontror(vi) AND 
comp_type( v¡) = Glue_Logic) then 
comp_type( vi) = Bit..Slice; 
} 
38 
} 
Reassignments during folding (see Algorithm 4.2); 
} 
Complexity analysis: It takes O( n + m) time to build a graph where n is the 
number of components and m is the number of edges in the netlist . In addition, 
it takes O(n) time each for initial type assignment, type assignment for undecided 
nodes and final assignment. Therefore, the complexity of component partitioning 
algorithm is O(n + m). 
4.4 Stack Partitioning 
The goal of stack partitioning is to minimize the layout area of the bit-sliced 
components. Since bit-sliced units often have varying bit-widths, the sliced-layout 
architecture generates an empty space within the stack bounding box. A folding 
method is used to place small units into the empty space and thus reduce the stack 
height. 
Using the folding method, we describe a stack-partitioning algorithm for 
minimizing the area of the bit-sliced units as follows: the algorithm first calcu-
lates the routing-area cost between unfolded and folded units. The routing area 
is proportional to the number of wires between unfolded and folded units. For 
instance, the nurnber of wires crossing the cutline between CompA and CompB is 
(Xul.11 Yul.1) -.....-------------, 
Cutlina _..,...,_.t--t,__. ___ 
( X lr.1 , Yír.1 ) 
CompA 
CompB 
,,,¡11. (X1r.2, Y1r.2) 
(a) 
39 
Yu1.2 < Y1r.1 and Xu12 < X1r.1 OVartap 
J 
CompA 1 ~:k~r:pA 111--,~ª--'compB 
(b) (e) 
l ...... ..__ Width constralnt 
(d) 
A rea 
Minimal araa 
2 3 4 
(f) 
n-----~~=r:-=-1: Empty space 
• for glua-logic 
: componants 
CQ~~~ªr 
(a) 
5 6 7 # of foldad units 
Figure 4.3: Stack partitioning based on folding: (a) linear placement, (b) unit 
folding, (e) overlap checking, ( d) height and width constraint checking, (e) width 
compression, (f) stack area. 
40 
four (Figure 4.3(a)). By folding at the cutline between CompA and CompB, the 
routing area will contain four wires (Figure 4.3( e)). The number of wires crossing 
any cutline can be determined after the unit placement and routing. The unit 
placement and routing consists of three steps: (1) sort the units by width, (2) 
permute the order of the units with the same width to minimize the track density 
and (3) assign connections to the routing tracks. Thereafter, the number of wires 
between units can be computed. 
The algorithm uses a folding method to minimize the area of the bit-sliced 
units subject to the given stack height and width requirements. The folding process 
includes two steps: unit folding and overlap checking and avoidance. The main 
constraint of stack folding is that bit-sliced units must not overlap. The algorithm 
folds one unit at a time. The folding process includes two steps: (1) move the 
unit to the right edge of stack's bounding box and rotate it around the center 
( CompB in Figure 4.3(b)) and (2) push all of the folded units above the base-line 
(Figure 4.3(c)). 
After unit folding, an overlap checking is performed to check whether the 
units in the folded part overlap with the unfolded part. The bounding box of unit 
Ui is defined by the upper-left point (xuu, Yul.i) and the lower-right point (xzr.i, Ylr.i) 
of unit Ui (Figure 4.3( a)). The overlapping occurs if there exists Ui E { unfolded 
bit-sliced units}, Uj E { folded bit-sliced units} and Xul.j < X/r.i and Yul.j < Ylr.i • 
41 
If an overlap occurs during the folding process, the folded units will be shifted 
to the right to avoid overlap until the given width constraint is reached. If the 
overlap still exist, the overlapped folded units will be removed to forma new stack. 
In Figure 4.3( d), CompC overlaps with unfolded units. Thus, it will be removed 
to form a separa te stack. Moreover, if the stack height is taller than the given 
height constraint, then the units exceeding the given height are moved to form a 
new stack. For example, CompA and CompB in Figure 4.3( d) will be removed 
to form a separate stack. In this case, the cut line will be placed at the height 
constraint baseline. 
In each folding iteration, the algorithm computes the total stack area that 
is the area of the minimum bounding box covering all the units and wires. For 
multiple stacks, the total area is computed as the sum of bounding boxes of indi-
vidual stacks and the routing area between stacks. The algorithm folds the units 
one at a time and selects the stack partition with the mínimum area. For instance, 
Figure 4.3( f) shows an area curve that was generated by executing the folding 
process repeatedly. Each data point represents the total area for a particular stack 
partition. For instance, the area data-point 1 in Figure 4.3(f) indicates the area 
of the bounding box covering the unfolded stack, as shown in Figure 4.3( a). The 
partition with the minimal area ( e.g., partition #4) was selected as the final stack 
partition. 
42 
In each folding pass, the components that were deleted and leftover campo-
nents are combined to produce the next stack using the same algorithm. Moreover, 
the stand-alone or leftover small bit-sliced units that do not fit in any stacks will 
be moved to the glue-logic module. For example, assume that the example in 
Figure 4.3( d) has the minimal area by moving CompC to form a new stack. Since 
CompC is a stand-alone unit, it will be relabeled as Glue.Logic. 
After selecting the stack partition, the algorithm performs a width compres-
sion to reduce the empty space between the unfolded and folded units (Figure 4.3(e)). 
The empty space in the bounding box can be used for placing the glue-logic com-
ponents will be described in the next section. 
Algorithm 4.2 describes the stack partitioning. The input to the algorithm 
is a set of bit-sliced units . The procedure place_and1oute() places units using 
the min-cut algorithm (KeLi70] and assigns routing tracks using the left-edge al-
gorithm [HaSt71]. The function overlap_check returns "1" when the units in the 
unfolded and folded stacks are overlapped; otherwise, return "O". The procedure 
unit....shift_adjustment() arranges the units in the folded stack to avoid overlap. 
Algorithm 4.2. Stack Partitioning. 
Let 
U={ u¡ 1 i = l..n} be a set of bit-sliced units; 
Uunfold={ Uj 1 j = l..m} be a set of unfolded units; 
U¡o1d={ u1c 1 k = l..p} be a set of folded units; 
Unew={ u¡ l l = l..r} be a set of units for the new stack; 
Uunf old_final be the final set of unfolded units; 
U¡ o/d_Jinal be the final set of folded units; 
Wconstraint be the width constraint; 
} 
Hconatraint be the height constraint; 
Wstack be the stack width; 
Hstack be the stack height; 
Anew be the total area of a new partition; 
Amin be the minimum total area; 
Stack_partitioning( U) { 
place...and_rou te( U); 
Amin = area_estimation( U); 
if ( empty space is too large for glue-logic components OR 
Hstack > Hconstraint) then { 
Uunfold = U; 
U¡o1d = e/>; 
for i = 1 to n do{ 
/*folding Ui* / 
Uunfold = Uunfold - { Ui}; 
U101d = U101d U { ui}; 
overlap = overlap_check(Uunfold, U¡old); 
if ( overlap = true) then { 
uniLshift_adjustment(U¡01d); 
overlap = overlap_check(Uunfold,Ufold); 
if ( overlap = true) then 
43 
Unew = Unew u { UiEU¡old AND Ui overlaps with UjEUunfold}; 
} 
} 
} 
if ( Hstack > Hconstraint) then 
Unew = Unew U { UiE{ U¡old, Uunfold} 
and Ui exceeds the height constraint}; 
Anew = area_estimation(Unew); 
if ( Ámin > Ánew AND Hconstraint ~ Hstack) then { 
U¡o/d_final = U¡old; 
Uunfold_final = Uunfold; 
Ámin = Ánewi 
} 
if ( Unew i= e/> AND Unew contains only small units) then 
Glue_Logic = Glue_Logic U Unew; 
el se 
Stack_F artitioning( Unew ); 
44 
Complexity analysis: For each stack folding process, the algorithm folds at most n 
units. The overlap checking procedure takes O( n) time. Therefore, the algorithm 
takes O( n 2) time for each stack folding process. Sin ce at most n stacks can be 
generated, the complexity of stack partitioning algorithm is O(n3 ). 
4.5 Glue-Logic Partitioning 
After forming the bit-sliced stack, a glue-logic partitioning is performed to 
assign the glue-logic components into clusters and to fill the empty area around 
the stack. The glue-logic partitioning algorithm consists of two phases: capacity 
estimation and iterative partitioning. In the capacity estimation phase, the algo-
rithm partitions layout area into area blocks to satisfy the required height, width 
and aspect ratio. In addition, the algorithm estimates the transistor capacity of 
each area block. 
In the iterative phase, the algorithm uses a seed-based multiway partition-
ing to assign glue-logic components into the area blocks. Seed-based partitioning 
reduces total wire length in two ways: it clusters the components on the critica! 
path and it moves the components toward their connecting ports. The algorithm 
runs iteratively and selects the partition with the mínimum total area as the final 
floorplan. 
1 
1 
( 1 
1 
45 
4.5.1 Definitions 
Let S(T) denote the size of glue-logic components in number of transistors. 
Tcritical-path denotes a set of components on the critica! path from an input port 
to an output port in the design. 
Let B={ bi 1 i = 1..k} denote a set of orthonormal edges defining a constraint 
area ( usually used by a datapath stack) not available for the layout of glue-logic. 
Similarly, let BF={bToP, bBOTTOM, bLEFT, bRIGHT} denote the rectangle of the 
final module layout. A set of ports on an edge bi is denoted by P(bi). 
The area A around the constraint area is partitioned into rectangular area 
blocks A ={ ai 1 i = 1..j}. Each ai is defined by a 4-tuple <bin bit, bit, bib>, where 
bir,bit,bit and bib are the right, left, top and bottom edges of the block. In addition, 
e( ai) denotes the transistor capaci ty of block ai. 
The topological relations between area blocks, the constraint area and module 
boundary are specified by an adjacency graph G(V, E), where V = BU BF U A 
and eij E E if and only if two vertices Vi and Vj have a common edge. The term 
w( eij) denotes the number of wires crossing the edge eii. 
The floorplan example in Figure 4.4( a) consists of one constraint rectilinear-
area B ={b1 ... bs} and one module area BF ={bToP, bBoTTOM, bLEFT,bRIGHT} with 
the given width (Wconstraint) and height (Hconstraint)· The area around B and inside 
BF is partitioned into 6 area blocks, that is, A=( a1 ••• a6 ). This example will be used 
46 
t 
~ 
ªe .&l 
• : Ports. O : Boundary vertices. 
O : Block vertices. 
b, 
br 
W conatralnt 
brOP 
8s 
b1 
~ 
beorrOM 
(a) 
(b) 
1 
T P3• P, iij 
b2 e: 
!e 
.! 
a, 1i S2 § a: a, 
.&l :e 
~ 
j 
Figure 4.4: The adjacency graph formation: (a) a floorplan example, (b) its cor-
responding adj acency graph. 
47 
throughout this section. Figure 4.4(b) shows the adjacency graph of the floorplan 
in Figure 4.4( a). The adjacency graph in Figure 4.4(b) consists of 6 block vertices 
(A ={ ai, a2, ... , a6}) corresponding to the area blocks in Figure 4.4(a). There are 
12 boundary vertices corresponding to the 8 edges of the constraint rectilinear-
area (B ={b1, b2, ... , bs}) and the 4 module edges (BF={!Jrop, bBoTTOM, bLEFT 
and bRIGHT} ). In addition, P(b3)={pi}, P(b4)={p2} and P(bnwHT )={p3, p4}. 
The adjacency edges connect vertices with common boundaries. For example in 
Figure 4.4(b), there is an edge between the boundary vertex b3 and the block ver-
tex a1 because these two vertices have a common boundary in ~. Since there is a 
port p1 on the boundary vertex b3 , w(b3 , ai)= l. 
4.5.2 Capacity Estimation 
In the capacity estimation phase, the algorithm first dissects the empty area 
into area blocks satisfying the height (Hconstraint ~ Hmoduze), width (Wconstraint ~ 
Wmodu1e) and aspect ratio (AspectJlatioconstraint = Wmodu1e/ Hmodu1e) constraints, 
where Hmodule and W module are the actual height and width of the layout module. 
Then the algorithm determines the transistor capacity of each area block. There are 
five possible area blocks (Figure 4.5): Jn_block, Left_block, Right_block, Top_block 
and Bottom_block. 
48 
In our implementation, we use a strip-layout style for glue-logic layout gener-
ation. In the strip-layout architecture, P and N transistors are placed in separate 
rows where a pair of P and N transistor rows is called a strip. For an area block, 
transistors can be placed into rows with vertical or horizontal orientation. The 
area estimation is provided by an estimator embedded in the databa.se (ICDB) 
[ChGa90]. Our area models formulate area estimation as a function of transistor 
and wire density. Given a netlist and an area block, the estimator performs min-
cut partitioning to estimate wiring density between transistor rows and provides 
the transistor capacity of the area block. 
The algorithm first dissects the empty area in the bounding box Jn_block into 
rectangles. Since the number of ports on the edges of constraint area are given, 
the global roúting area in the bounding box can be estimated. For example in 
Figure 4.5, the empty area is dissected into two rectangles a1 and a2 with heights 
and widths ( hi, wt) and ( h2 , w2), respectively. Both horizontal and vertical layout 
orientations are tried on each block. The algorithm selects the one with the highest 
area utilization. 
After estimating the transistor capacity of the empty area in the bounding 
box, the algorithm then estimates the transistor capacity of the outer blocks. For 
simplicity, we assume the initial capacity of Left_block and Top_block are zero. As 
a result, the algorithm only needs to estimate the transistor capacity of Right_block 
and Bottom_block. The algorithm places transistor rows into Right_block and 
49 
W constraint 
1 .. wmodule •1 
Bounding 
1 
Top_block 
box 
,...__, 
T e: .:::¿, .:::¿, ·e 8 8 u; Constraint 8 :e ~ :e a rea _, C8 =' .&:. :a J: <D C> ~ ...J 1 a: 
W2 ::z:: 
·a2 1 
Bottom_block 
Transistor ln_block 
row 
Figure 4.5: Area dissection and capacity estimation. 
50 
Bottom_block one strip at a time according to the given aspect ratio. Finally, the 
algorithm estimates the transistor capacity of Right_block and Bottom_block. 
Algorithm 4.3 describes the capacity estimation. The input to the algorithm 
is a set of area blocks. The procedure transistor _capacity...estimation( a¡) returns 
the maximum number of transistors that can be placed in the given area-block 
a¡. The procedure area_estimation() returns the overall module height and width 
including bit-sliced stack and glue-logic blocks. 
Algorithm 4.3. Capacity Estimation. 
Let 
A={ a¡ 1 i = l..n} be a set of area blocks; 
e( ai) be the transistor capacity of area block a¡; 
m be the number of rectangles in the stack bounding box; 
t be the number of transistors not being placed yet; 
ST be the total number of transistors in the glue-logic; 
tstrip be the number of transistors in one strip; 
Hmodule be the module height; 
W module be the module width; 
Hbounding-box be the height of stack bounding box; 
Wbounding-box be the width of stack bounding box; 
Aspect.Jlatioconstraint be the aspect ratio constraint; 
capacity...estimation(A){ 
t = ST; 
for i = 1 to m do{ 
} 
e( a¡) = transistor_capacity _estimation( a¡); 
t = t - c(a¡); 
c(Left_block) =O; 
c(TQp_block) =O; 
e( Rig h t_block) = O; 
c(Bottom_block) =O; 
W module = Wbounding-box; 
Hmodule = Hbounding-box; 
while (t > O) do{ 
if (Aspect.JlatÍOconstraint > Wmodule/ Hmodule) then{ 
/*place transistor rows into right_block* / 
l 
1 } 
} 
} 
else{ 
} 
tstrip = transistor_capacity ..estimation(Hmodule); 
c(RighLblock) = c(RighLblock) + istrip; 
/*place transistor rows into bottom_block* / 
t strip = transistor _capaci ty ..estimation( W module); 
c(Bottom_block) = c(Bottom_block) + istripi 
t = t - istrip; 
/*estimate the module height and width for the new partition* / 
Wmodule = area_estimation(Wbounding-box,c(RighLblock)); 
Hmodule = area_estimation( Hbounding-box ,e( B ottom_block)); 
51 
Complexity analysis: For m rectangles in the bounding box, it takes O( m) time 
to estimate the transistor capacity of m rectangles. For the transistor capacity 
estimations of RighLblock and Bottom_block, it takes approximately O(n) time 
where n = t/tstrip· Thus, the complexity of the capacity estimation algorithm is 
O(m + n). 
4.5.3 Iterative Partitioning 
Seed-Based Multiway Partitioning 
The seed-based multiway partitioning algorithm is an extension of the KLFM 
min-cut partitioning algorithm. Let A={ ai 1 i = l..n} be the set of pre-defined 
area blocks and c(ai) be the transistor capacity of area block ªi· The total number 
of transistors of the glue-logic components is S(T) ~ l:?=i e( ai)· The algorithm 
52 
performs min-cut partitioning repeatedly based on the cut-set size of the partition 
(Ca, Cb) where Ca =c(ai) and Cb =L:j=i+ic(ai), i = l..n - l. 
In order to minimize the wire length on the critica! paths, all the compo-
nents on the critica! paths are clustered together. These components are called 
seed components. Moreover, the components connected to ports on the constraint 
area and the module boundaries are placed close to those ports. The nets con-
nected to the ports on the constraint-area edges and the module edges are called 
seed nets. The seed-based approach is an extension of the terminal propagation 
strategy [DuKe85] that takes into account the externa! connections between blocks. 
The algorithm performs seed-net clustering for both cut-sets before the partition-
ing process takes place. Using the hierarchical-clustering technique [John67], the 
algorithm is capable of successively fusing seed nets into clusters for each cut-set 
at different stages of the partitioning process. The seed clustering is divided into 
two parts: seed-component clustering and seed-net clustering. 
In the component clustering, the algorithm first groups the components on 
the critical path (Tcritical-path)· Then, the algorithm evaluates the connectivities 
between the nets in Tcritical-path and their connecting ports. The components in 
T critical-path will be placed into the area block with the maximum "closeness" cost. 
To take external connections into account, the net clustering determines the 
seed nets for both cut-sets. During the partitioning process, the seed nets will 
............................. 
Cut-set ./ 
') ......... 
• 
o 
: Ports. ······ ... 
•. 
··. : Boundary vertices. · · .. 
• .. 
O : Block vertlces. 
Figure 4.6: Cut-set adjacency graph. 
53 
. 
... ••\ 
Cut-set A 
54 
be assigned a higher weight to enhance the connectivity. This approach pulls 
the components connected to the seed nets toward their connecting ports. For 
example in Figure 4.6, the area block a4 in Cut-SetA is the current area block to 
be partitioned. The module right edge, bRIGHT, is solely adjacent to the right edge 
of a4. The components connected to the ports on the bRiaHT are preferred to be 
assigned to the area block a4 • Otherwise, if these components are assigned to the 
other area blocks, the routing length will increase. Therefore, the net connected to 
ports on bRiaHT will be assigned to Cut-SetA as the seed nets. The net clustering 
consists of two steps : cut-set adjacency graph formation and net clustering. 
In the first step, the algorithm transforms the adjacency graph into cut-set 
groups. There are three cut-set groups: (1) Cut-Set A is the current area block 
to be filled (in Figure 4.6 Cut-Set A ={ a4 } ), (2) Cut-SetB consists of a set of 
currently empty area-blocks ( Cut-S etB ={ a2 , a3 , a5 , a6}) and (3) a filled-block set, 
B¡illed={b¡illed(i) 1 i = 1..n}, which contains all of the filled blocks with their 
adjacency boundaries. For example in Figure 4.6, b¡illed(t)={ a1 , b3 , b4 } contains a 
filled block { ai} and its adjacency boundaries {b3 ,b4 }. 
In the second step, the algorithm determines seed nets for both cut-sets. The 
nets and ports that are solely adjacent to certain cut-sets are called the seed nets 
of that cut-set. For example in Figure 4.6, b2 and bRIGHT are solely adjacent to 
the block vertex a4 in Cut-SetA. Thus, the nets and ports of b2 and bRIGHT will be 
assigned as the seed nets of Cut-Set A. On the other hand, {bi,bs,b6,b1,bs,bLEFT} 
55 
are solely adjacent to the block vertices in Cut-SetB. Thus, the nets and ports of 
{b1 ,bs,b6,b1,bs,bLEFT} are the seed nets of Cut-Set B. 
The Algorithm 
In the capacity estimation phase, the algorithm determines the transistor ca-
pacity of area blocks. Initially, the algorithm assumes c(Top_b/ock) and c(LefLblock) 
are zero. However, the transistor rows in B ottom_b/ock can also be placed in 
Top_b/ock. Further, the transistor rows in RighLblock can be placed in LefLblock. 
In this phase, the algorithm uses an iterative partitioning method to find the min-
imum area partition by rearranging the capacity of area blocks. 
Based on the example in Figure 4.4, the algorithm initially assigns e( a6 )=0, 
c(a4 )=c(RighLblock), c(a3 )=c(Bottom_b/ock) and c(a5 )=0. The algorithm then 
rearranges the capacity of area blocks during partition by assigning strips from 
a4 to a6 and from a3 to a5• In order to minimize routing, finally, the algorithm 
performs seed-based multiway partitioning to assign components into area blocks 
according to the assigned transistor capacity of area blocks. After each partitioning 
iteration, the total layout area is calculated. The total layout area consists of three 
parts: 
l. Constraint area, Areaconstraint-area, which is the area of bit-sliced stacks or 
macrocells. 
56 
2. Glue-logic area. After partitioning, the glue-logic components are placed 
into a set of area blocks A={ a¡ 1 i = l..n }. The area for each block a¡, 
Area9 zue-logic-block(i), is estimated. 
3. Routing area. After partitioning, the glue-logic components are placed into 
a set of area blocks. The cutlines crossing the boundary are estimated. The 
routing area, Area91obal-routing, between two area blocks is calculated in terms 
of the number of cutlines crossing these two blocks. The total layout area is 
n 
Total-A.rea = Areaconstraint-area + L Areaglue-logic-block(i) + Areaglobal-routing 
i=l 
( 4.3) 
Algorithm 4.4 describes the glue-logic partitioning. The input to the algo-
rithm is a set of area blocks anda glue-logic netlist. The procedure capacity....estimation() 
is described in Algorithm 4.3. The procedure Seed_based_multiway...partition() as-
signs glue-logic components into area blocks. The function total....area_calculation() 
returns the total area-cost using Equation 4.3. The set M keeps track of the set 
of area blocks with the minimum total area. The algorithm runs iteratively and 
selects the partition with the minimum total area as the final :floorplan. 
Algorithm 4.4. Glue-Logic Partitioning. 
Let 
GL be a glue-logic netlist; 
A={ a¡ 1 i = l..n} be a set of area blocks; 
M ={a¡ 1 i = l..n} be a set of area blocks for the final :floorplan; 
c(bottom_tr _row) denote the transistor capacity of one strip in Bottom_block; 
c(right_tr _row) denote the transistor capacity of one strip in Right_block; 
Glue_LogicYartitioning(A, GL ){ 
/*initial partitioning* / 
} 
/*Determine c(In_b/ock), c(Right_b/ock) and c(Bottom_b/ock)* / 
capaci ty _estimation( A); 
Build adjacency graph; 
c(a4) = c(RighLblock); 
e( a3) = e( B ottom_b/ock); 
Seed_based_multiway..partition(A,GL); 
TotaLArea = totaLarea_calculation(); 
/*iterative partitioning* / 
while (e( a4 ) > O){ 
} 
e( a6) = e( a6) + c(right_tr 1ow ); 
c(a4) = c(a4) - c(right_tr_row); 
Seed_based_multiway ..partition( A,GL ); 
TotaLAreanew-partition = totaLarea_calculation(); 
if (Total.Área > Total.Areanew-partition){ 
M = {a¡ 1 i = l.. n}; 
Total.Area = Total.Areanew-partitioni 
} 
while (e( a3 ) > O){ 
} 
e( as) = e( as) + c(bottom_tr 1ow ); 
e( a3) = e( a3) - c(bottom_.tr 1ow ); 
Seed_basedJTiultiway..partition(A,GL); 
Total.Areanew-partition = totaLarea_calculation(); 
if (Total.Área > TotaLA.reanew-partition){ 
M = { ai 1 i = l.. n}; 
Total.Area = TotaLAreanew-partitioni 
} 
57 
Complexity analysis: The seed-based multiway partitioning uses the bucket-list 
data structure [FiMa82] which has the complexity of O(p) for each iteration where 
pis the number of pins. There are two iterative processes taking O(m1) and O(m2) 
time, where m1 is the number of strips in Right_block and m2 is the number of 
strips in Bottom_block. The complexity of glue-logic partitioning algorithm is 
58 
4.6 Results 
SLAM currently runs on SUN3/SUN4 workstations under the UNIX operat-
ing system. Several examples have been tested. The layouts were generated using 
a 3µm CMOS technology. 
The first example is a controlled counter (Arms89] that consists of approx-
imately 50% sliceable components and 50% non-sliceable components. The final 
fioorplan and layout are shown in Figure 4. 7 and Figure 4.8. It consists of an 
unfolded stack and a folded stack with a glue-logic block. The second example 
is the MARKl simple computer (SiBN82] which in.eludes 32, 16, 13, and 3 bit 
register-transfer components and simple gates (Figure 4.9 and Figure 4.10). The 
register-transfer schematics of both examples were generated by VSS (LiGa88]. 
The third example is the digital section of a DSP chip consisting of an AL U, reg-
isters, flip-fiops, a shifter, counters, latches and simple gates. The final layout is 
shown in Figure 4.11 . 
Using the same layout generators, we compared the layouts generated using 
our partitioning algorithms with sliced-layout architecture to that using a man-
ual fioorplanning with traditional layout architectures. In the second case, the 
register-transfer schematics were first partitioned in.to modules consisting of bit-
sliced and glue-logic components. For example, the controlled-counter example 
was partitioned to a set of bit-sliced units (such as a up/down counter, registers 
59 
and drivers) and a glue-logic unit. Since the bit-sliced units of this example have 
varying bit widths, we use the layout architecture of bit slices with routing chan-
nel. Each bit-sliced unit wa.s laid out individually. Then, we used an interactive 
floorplanner to find the minimum area floorplan by placing units using an exhaus-
ti ve search. To estimate the wire length on the critica! path, we first identified 
the components on the critica! path from the final layout and then measured the 
wire length connecting all those components. The results in Figure 4.12 show that 
the layouts were 10% smaller and the wire lengths on critica! paths were 20%-25% 
shorter when our partitioning algorithms and layout architecture were used. 
We have also tested the glue-logic partitioning capability on an example of 
515 gates ( approximately 3000 transistors ), 254 I/O ports, and 900 nets. The 
algorithm partitioned the design into five blocks. The final layout is shown in 
Figure 4.13. We have tested the partitions with and without seed clustering. The 
experimental results show that using seed clustering the total wire length on the 
critical path is 15% shorter. 
60 
[ 
1 ~ ... u~-~ r•••c••L" 
f.!-
lt 
~ 2-bit 
6-bit M• 
~ 
2-bit 
~ 
4-bit 
1 ~ 4-bit :::: 4-bit ~ 
. . . 
. . . 
·-· 
.JJ JL .!L ..!L ]" J -:rr _'] 
¡ Glue-logic 
1to:!i1 .A.•• · ••t•!..r_ :t:" ·· ·~~ •2 t T!XT l.l • J1 (77 1 - l !í~ . ~) trans_cad l 
~ l 
Figure 4. 7: The floorplan of the controlled counter example. 
Figure 4.8: The layout of the controlled counter example. 
61 
Figure 4.9: The :floorplan of the MARKl simple computer. 
Figure 4.10: The layout of the MARKl simple computer. 
62 
Figure 4.11: The layout of the DSP example. 
Example Our (A) um2 Manual (B) um2 A/8 
Controlled 450,328 498,883 .902 counter 
DSP 7,056,000 7,896,042 .893 
MARK1 11,220,000 12,701,040 .883 
(a) 
Example Our (A) um Manual (B) um A/B 
Controlled 594 765 .776 counter 
DSP 2,665 3,650 .730 
MARK1 3,885 4,950 .784 
(b) 
Figure 4.12: The comparisons of our partitioning and :floorplanning with a manual 
partitioning and :floorplanning: (a) total area, (b) the critica! path wire length. 
63 
Figure 4.13: The layout of a glue-logic partitioning example. 
64 
4. 7 Conclusions 
In this chapter, we described a new partitioning rnethodology for layout gen-
eration from register-transfer netlists based on the sliced-layout architecture. We 
described a new algorithm for netlist partitioning and one for partitioning of bit-
sliced cornponents. We also described a seed-based rnultiway partitioning algo-
rithm based on area capacity. Our partitioning algorithms are carried out in a 
top-clown rnanner that generates the layout by considering the component layout 
style, floorplan and critica! paths sirnultaneously. The prelirninary results show 
that this approach improves the overall area utilization and rninirnizes the critica! 
wire lengths, which in turn yields better performance. 
Currently, SLAM uses a simple linear-folding rnethod for stack partitioning. 
More sophisticated folding technique [LaGW91], such as interleave folding, should 
be studied further. Furthermore, in order to · generate a complete chip, SLAM 
needs to incorporate a general floorplanner and I/0-pad placement and routing 
algorithm. 
Chapter 5 
Quality Measures 
Traditionally, the number and size of functional units, storage units and con-
nections or the number of AND, O R and N OT operators in the Boolean expression 
of the unit are used as area quality-measure in behavioral synthesis. However, these 
area measures assume that layout area is directly proportional to the number and 
size of RT components and do not take into account layout technology factors such 
as layout styles, component libraries, and impact of floorplanning, placement and 
routing. These factors often greatly affect the final layout of the design. Similarly, 
the number of control steps is usually used as performance quality-measure in be-
havioral synthesis. This performance measure is valid only if the dock cyde is 
fixed. However, the number of control steps do not reflect the total execution time 
which is equal to the product of control steps and the dock cyde. 
This chapter presents two quality measures, area and performance, used to 
support design decisions and to determine the quality of the final synthesized 
design in behavioral synthesis. This proposed layout model takes into account 
65 
66 
most technology factors such as layout architectures and technology mapping, and 
thus provides more accurate estimates than previous proposed models. Two main 
factors of quality mea.sures, accuracy and fidelity, are also addressed in this chapter. 
The remainder of this chapter is organized in the following manner. Section 
5.1 describes the relationship between structural and physical designs. Section 
5.2 describes the area measures including datapath and control unit. Sections 5.3 
describes the performance measures including datapath delay, control delay and 
dock estimation. Section 5.4 presents the experimental results. Finally, Section 
5.5 concludes our approach. 
5.1 The Relationship between Structural and 
Physical Designs 
Behavioral synthesis transforms an input behavioral description into a struc-
tural design composed of a datapath netlist and a control unit. Typically, the dat-
apath neÜist consists of a set of generic register-transfer (RT) components ( e.g., 
functional, interconnect and storage units ). The control unit specifies the control 
signals for executing register transfers in each state, as well as the sequencing for 
the design's next state (Figure 5.1). 
Structural design 
Control unit Datapath 
present next reg. 
state cond. state transf. 
Status reg 
·-- -- Floorplan 
' 1 -------- ____ ................. 
PLA Bit-sliced 
stack 
Figure 5.1: The relationship between structural and physical designs. 
67 
68 
The process of generating fabrication data for custom or semicustom tech-
nologies from a structural design consists of severa! steps (Figure 5.1), including 
technology mapping, module generation, floorplanning, placement and routing. 
Technology mapping assigns real components from a physical library to the generic 
components in the structural design. It is usually followed by sorne optimization 
procedures to reduce the total area and delay of the design. We also assume that 
the technology mapper selects different layout styles for different parts of the de-
sign. For example, regularly structured units, such as adders, subtracters, ALUs, 
multiplexers and registers, can be mapped into bit-sliced stacks, general cells or 
standard cells; the control-state table can be mapped into a PLA or standard cells; 
and storage units, such as RAMs, ROMs and register files, as well as functional 
units, such as multipliers, are mapped into macro cells. The design may contain 
a mixture of arbitrarily sized bit-sliced stacks, general cells, macros and standard-
cell blocks, which we simply refer to as "modules". Module generators perform 
floorplanning and routing of each module independently. The chip floorplanner 
determines the positions and interconnections of modules and generates the chip 
layout. 
The total area of a design is the sum of the area of its modules, 1/0 pads and 
pad drivers, the chip routing areas and the remaining wasted areas, as illustrated in 
Figure 5.1. Since macros are predesigned, their areas and shapes can be obtained 
directly from component libraries. However, the areas and shapes of datapath 
69 
and control modules mapped into bit-sliced stacks, standard cell blocks or PLA 
macrocells vary greatly depending on their layout styles and architectures. In the 
next two sections, 1 will discuss area and performance measures for datapaths and 
control units using sorne sample layout styles. 
5.2 Area Measures 
This section describes area measures for the datapath and control unit of the 
FSMD model using two different layout styles for CMOS technology. 
5.2.1 Datapath 
A datapath consists of a set of regularly structured RT components, such as 
ALUs, multiplexers, latches, drivers and shifters. Datapath layout is accomplished 
with a stack of functional and storage units that are placed one above the other. 
Each unit consists of bit slices and all units have bit slices of the same width. 
However, bit slices in different units may be of di:fferent height. All bit slices are 
aligned starting with the least-significant bit (LSB) and distinct units are stacked 
on top of another. Thus, the stack grows horizontally when the bit width increases, 
and it grows vertically when the number of units increases (Figure 5.3( a)). Each 
bit slice of a unit may be a handcrafted custom cell or may be implemented with one 
70 
Routing 
channel 
LSB MSB , MSB 
f Control lines f 
(metal1) 
Hdp 
1 
(b) Bit slice 
l ~ to-Data lines (metal2) 
T 
j_ 
. T. 
l 
(a) Bit shce 
Figure 5.2: Two data path layout architectures using: (a) custom cells, (b) stan-
dard cells. 
row of connected standard cells as shown in Figure 5.2(a) and (b), respectively. 
The difference between custom and standard-cell styles is in the layers used for 
routing of control and data wires, use of custom· or standard cells and routing of 
data lines over the cells or in a separate channel. 
In the first layout architecture (Figure 5.3(b) ), diffusion strips for P and 
N transistors are placed horizontally. Power and ground wires run horizontally 
in the first metal layer. The control lines common to different bit slices in each 
unit run also horizontally in the first metal layer. Data lines connecting distinct 
units in each bit slice run vertically in the second metal layer. In the second layout 
architecture (Figure 5.3( e)), a bit slice of each unit consists of one or more standard 
cells. P and N diffusion strips are placed vertically. Power and ground wires run 
vertically in the first metal layer. Control lines run over the standard cells in the 
\, 
¡-
Unit 1 
-¡--~ ~ ~~ 2 
e 
~ . 
'i Unitn 
1 
l. 
LSB MSB 
Unit 1 
Unit2 
.•..•.. .•••••• ••. ..•.•.•. •. . ..•.•••.••..•••.•..•••. Hdp 
....... ....... ... ........ .. ....... . ..................... . 
Unit n 
~-··· 
... 
Data 
le~. 
ofU=i I' 
'-' .J 
..- Hcen 
.... 
--
.......... 
.. · ... 
·" ... " . . .. ... 
... (a) 
Extra 
/routing 
are a 
"'JI' 
,G round 
p CIN91' 
ound :Gr 
p 
ound Gr 
Over-the-cell 
routing track ~ 
p CINe!' 
ound :Gr 
- P r:INer 
~Gro und 
c:=J Diffusion 
i:s::s::s:m Metal 1 
1i!2ZZZ3 Me tal 2 
c::::::1 Poly 
- Power/ 
Ground 
" . . .. 
"· 
·· .. , Routing 
channel 
Pow_er _ G_ro_u~ ••• • I: 
.s 
·e: 
Unit 1 
'i Unitn 
1 ~ 1 :H~1• li,. :1 wbit 
(e) 
71 
Figure 5.3: The layout models: (a) datapath stack, (b) custom cell architecture, 
(e) standard cell archi tecture. 
72 
second metal layer. Data lines are placed in the routing channel and run vertically 
in the first metal or the polysilicon layer. The connections between standard cells 
inside each bit slice are also placed in the routing channel. 
To compute the height (Hdp) of a bit slice (Figure 5.3(a)), we observe that 
Hdp is proportional to the number of transistors in the bit slice. Each bit slice 
of a unit in Figure 5.3( c) consists of severa! diffusion strips separated by gaps. 
The transistors on each diffusion strip are separated by metal-diffusion contacts 
or by the minimum poly-to-poly spacing. Thus, the width of Unit (Wunit) in 
Figure 5.3( e) can be computed as a product of the number of transistors (tr( unit)) 
and the transistor-pi tch coeffi.cient (a) in µm /transistor. a is obtained by averaging 
the ratio of cell width and the number of transistors per cell over all units in the 
library. Thus, 
Wunit = O:' X tr( unit). (5.1) 
Consequently, the height of the bit-sliced stack of n units is 
n n 
Hdp = L:Wunit, =a X (Ltr(unit¡)). (5.2) 
i=l i=l 
The Equation 5.2 will hold for standard-cell architecture even if each bit slice 
is implemented with two or more rows of standard cells. Obviously, a different 
coefficient a' must be used in that case. Thus, the height of the bit-sliced stack of 
n units with an m-row implementation is 
n 
Hdp = a' x (L tr( unit¡) )/m. (5.3) 
i=l 
73 
Similar assumptions can be made for custom-cell architecture shown in Figure 5.3(b ). 
Although the P and N strips are placed horizontally in severa! rows, Wunit can be 
computed by Equation 5.2 using a different transistor-pitch coefficient a". This 
assumption holds because the height of the unit slice (Hcell) is a constant and the 
unit width (Wunit) thus must refiect the size of the cell in number of transistors. 
The width Wbit of a bit slice is equal to the sum of the height of the unit slice 
(Hcell) and the height of the routing channel (Hch)· For both layout architectures 
Hcell is a constant since all unit slices are predesigned to be of the sarne height. 
Hch is calculated as a product of the wire pitch (/3), and the difference between 
the number of estimated routing tracks (Trkest) required to completely connect 
all nets in one bit slice and the number of available over-the-cell routing tracks 
(Trktop)· Thus, 
¡ O; Hch = /3 X (Trkest - Trktop); if Trktop < Trkest if Trktop ~ Trkest (5.4) 
where Trktop in standard-cell architecture is equal to zero, and coefficient /3 is 
equal to the sum of the minimal wire width and the minimal spacing between two 
metal wires. An estimate for the required number of tracks in each bit slice can 
be obtained only after the position of each unit in the bit slice is determined. A 
fast algorithm with pseudo linear time complexity, such as the rnin-cut algorithm 
[FiMa82] can be used for this purpose. The required number of tracks can be 
estimated by the maximum density which is defined as the maximum number of 
74 
connections across any cut perpendicular to the channel. A better estimate can be 
obtained by using sorne simple routing algorithms, such as the left-edge algorithm 
[HaSt71] which has O(nlog n) complexity where nis number of nets. Thus, the 
datapath area (Adp) can be calculated as a product of the number of bits (bw) and 
the area of one bit slice, i.e., 
(5.5) 
The Equation 5.5 gives an upper bound on the datapath area. The bound 
is proportional to the product of the number of transistors and the number of 
routing tracks. The number of transistors can be approximated from the Boolean 
expressions describing each unit slice or counted from its schematic. The number 
of tracks can be approximated by the track density after a linear placement. Better 
estimates can be achieved with algorithms of higher complexity. Since the number 
of components in the datapath is small, those more accurate estimates are not 
necessarily computationally intensive. 
5.2.2 Control Unit 
The control unit in a FSMD model can be described by the control state-
table which specifies the next-state and control signals as a function of present 
states and conditional or status signals in a tabular form. Figure 5.4(a) shows 
a sample control state table with the input present-state and conditional/status 
75 
signals and the output next-state and control signals. The present states are 
encoded as binary values pA; ... p1 p0 , where k ~ flog2ml - l and mis the number of 
states. Similarly, next states are encoded as binary values rk .. . ri ro. Each output 
<; controls a functional, storage or interconnection components in the datapath, 
whereas the outputs ri specify the next state. 
The control unit consists of a state register and control logic. There are 
two commonly used techniques for implementation of control units: standard cells 
and programmable logic arrays (PLA). This section describes area estimates for 
standard-cell and PLA implementations based on sum-of-product expressions of 
next-state and control signals (Figure 5.4(b) ). In reality, a number of optimization 
procedures, such as logic minimization or PLA folding, are applied in order to 
reduce the size of the control logic. We ignore the impact of optimization and give 
an upper bound on the control-unit area. 
Standard-Cell lmplementation 
To simplify area estimates for standard cell implementation, we make a num-
ber of assumptions. We assume that a product term in the sum-of-product expres-
sion for each output signa! includes each present-state signa! and (in the worst case) 
each conditional/status signa!; these inputs to the product term may be comple-
mented. We also assume that each product term is implemented by an AND gate 
76 
Input Output 
11 12 13 l4 Is 0 1 0 2 03 04 
Present state l Condítions/ 
: status 
P1 Po; ~ S, ~ 
Next : Control 
state ; signals 
r1 ro ; e, e,:, 
State 1 : o 1 : 1 o o 1 o : o 1 
State2: o : 1 o 1 o : 1 o 
State3 ; o : o o 1 1 : 1 1 
1 
(a) 
\· ----t---+---+----
'2'2' ---t-+--1-+--____. __ 
'313 -----1-t---
'41 • -------\15' 
(e) 
lnputs 
11 11' ~ ~· 13 13 14 1.i Is Is' 
1 
01 •(ll '2 '31.4 Is') OR (11 '2' '3 l.; Is) OR 
(11 'i lj 14 '5 ) 
Oi =( 11 ~· 13 14 Is) 
~ =-(11 ~· '314 Is) OR (11'2'13 14 Is) 
04 ::1(11' '2 '31.i Is') OR (11 'i lj l.i ~) 
(b) 
\·-----------
'2'2' ---++--++--++---
'313 ~-+H-~~-++-+---
41 • --++1i+--~~~ ....... -
\ 1s· 
(d) 
Clusters 
• • • Hcell 
(e) 
Hsc 
Figure 5.4: Control unit description: (a) state table, (b) Boolean equations for out-
put signals, (e) two-level AND-OR implementation, ( d) two-level NAND-NAND 
implementation, (e) standard cell layout style. 
77 
and the sum-of-products by an OR gate, as illustrated in Figure 5.4( c) for output 
0 1• Obviously, we can replace the AND-OR implementation with an equivalent 
NAND-NAND implementation as shown in Figure 5.4(d). 
We assume that all the gates for implementation of the control unit are placed 
in a single row of cells, the inputs appear at the top, and the outputs appear at the 
bottom, as shown in Figure 5.4(e). We also assume that all the gates needed for 
the implementation of an output signal are clustered together as shown for signa! 
0 1 in Figure 5.4(e). This requirement for strong clustering prevents sharing of an 
AND gate between two expressions with the same product term. The layout area 
of the control unit using the standard-cell implementation (A.,c) is then equal to 
the product of width (Wsc) and height (Hsc), that is, 
(5.6) 
where Wsc is proportional to the number of transistors and H.,c is proportional to 
the number of routing tracks. The number of transistors can be computed from 
the sum-of-product expression for each output signa!. 
In CMOS technology, each n-input AND or OR gate has 2n + 2 transistors. 
Note that n-input NAND and NOR gates need only 2n transistors. Since each 
product term in a sum-of-product expression is implemented with one AND gate 
and one OR gate (e.g., Figure 5.4(c)), we can compute the required number of 
transistors. Each literal in a product term contributes 2 transistors, each product 
78 
term contributes 2 transistors in the AND gate and 2 transistors in the OR gate 
and the OR gate contributes an additional 2 transistors. 
Let occur( O¡) and term( Oi) be the number of occurrences of literals and 
number of terms in the sum-of-product expression of the signal oi, respectively. 
Let tr(Reg) be the number of transistors in one-bit state register. Thus, for an 
AND-OR implementation of the control unit, the width of the control unit is 
computed as 
Wsc =a X ((L:[;f2ml+n(2term(O¡)occur(Oi) + 4term(Oi) + 2)) 
+(flog2ml x tr(Reg))), (5.7) 
where m is the number of states and n is the number of control signals and a is 
the transistor-pitch coefficient as defined in Section 5.2.1. 
The height Hsc is computed as the sum of the standard-cell height ( Hcell) 
and the channel height (Hch)· Hch is proportional to the number of tracks used 
by input signals and interna! nets connecting AND and OR gates. We assume 
that each input signal requires two tracks for the true and complemented values. 
We assume that there are flog2m l state signals and C conditional/status signals. 
The maximum number of tracks required for routing internal nets in each cluster 
is equal to the number of terms in the sum-of-product expression for a particular 
output signal. Since clusters do not overlap, the maximum number of required 
tracks to route all interna! nets is equal to the largest number of terms used in any 
79 
J.-W60 /R~ 
l_ ___ ----- -~~ ------- --~ j ¡- -,--1--1-n HfT 
Hch ' '] • 'H 
H'""' 1 1 1 111111 11 11 1 1 Hoc ! 11111 1 ! rock t 
1 11 1 1 ~ j: 
(a) (b) 
Figure 5.5: Different aspect ratios of the control logic: (a) one-row implemeritation, 
(b) three-row implementation. 
particular sum-of-product expressions of an output signal. Thus, the height of the 
control unit Hsc is 
where (3 is the wire pitch in the channel. 
Although we assumed a single-row layout, control logic modules with different 
aspect ratios are needed in reality. Different aspect ratios can be obtained by laying 
the control logic module in severa! rows. We approximate this folding process by 
assuming that we can evenly partition the single-row layout of Figure 5.5( a) into 
three rows of equal width as shown in Figure 5.5(b ). We assume that input and 
output pins are positioned at the top and bottom of the control unit. Further, 
we assume that inputs reach all rows using existing polysilicon lines in each row 
as feedthroughs, and output nets are routed vertically ( over the cells) in metal2. 
We also assume that output clusters are not split across folded rows. Then, the 
80 
width of such a multiple-row module is W.,c/ R where R is the number of rows. 
The height of the module is the sum of all row heights. The height of each row 
can be computed by Equation 5.8 in which only the interna! nets of clusters in 
that particular row contribute to the channel density of that row. Since Hch x R ~ 
Hchl + Hch2 + ... + HchR, we can use Hch as an upper bound on the channel height for 
each row. With these assumptions, the total height of the folded implementation 
(Hblock) is Hsc X R, while its width is Wsc/ R. Hence, we can still use the Equation 
5.6 as an estimate for control-logic layouts of different aspect ratios. 
Programmable Logic Array 
A programrnable logic array (PLA) is frequently used to implement combina-
toria! and sequential logic, and in particular, the control units of FSMDs. A PLA 
consists of AND and OR arrays supported by input and output buffers, input and 
output latches and product term buffers as shown in Figure 5.6. Input buffers are 
needed to drive the AND array, product term buffers to drive the OR array and 
output buffers to drive the externa! logic. The input and output latches are used 
for the sequential logic. 
The width of the PLA module (WPLA) is the sum of the width of the input 
AND array (Win), the width of product-term buffers (Wp) and the width of the 
OR array (Wout) (Figure 5.6(b)). Win equals the number of inputs (n) multiplied 
81 
____ WpLA 
AN D array . OR array 
t tQ..,-
11 12 13 14 Is 01 02 ~ 04 lnputs Clock Outputs 
(a) (b) 
Figure 5.6: PLA layout model: (a) logic mapping, (b) layout model. 
by the the maximum of the latch width (lw) and the buffer width (bw)· Similarly, 
Wout equals the product of M AX(lw, bw) and the number of outputs (m). 
The height of the PLA (HPLA) is computed as a sum of the latch height 
(lh), buffer height (bh) and the height of the AND-OR plane. The height of the 
AND-OR plane is determined by the product of the number of distinct product 
terms (p), and the transistor-row pitch (r). Thus, the area of a PLA is 
5.3 Performance Measures 
We model the chip layout as a set of connected blocks that include control 
units, datapaths, macrocells and memories, as shown in Figure 5.7. Typically, 
a datapath consisting of a set of regularly structure components is implemented 
82 
. . . Control unit 
Oatapath unit : . ................. 
> ! •..•....... 
Macrocell:; 
....... ,, .... 
. . 
----·-· ._ .... · .... · .... f .. ·r .. r .. ¡ 
Macrocell 
;-··· -.............. ~ 
· · · · ···\ RAM 
....... ,. 
----- .................. . 
o 
Figure 5. 7: Constituents of a chip. 
using a bit-sliced stack, standard cells or macrocells. A control unit is implemented 
using a PLA or standard cells. Macrocells include sorne predefined components 
such as multipliers and barrel shifters. Memories include register files, RAMs and 
ROMs. 
The remainder of this section is organized in the following manner. Section 
5.3.1 describes the electrical models including wiring and component delays. Section 
5.3.2 describes the delay model for each block, as shown in Figure 5. 7. Since 
macrocells and memories are usually available in a library as predesigned blocks, 
we assume that timing information for macrocells and memories is provided by the 
library. Section 5.3.3 and 5.3.4 describe the delay models for the datapath and the 
control unit. Section 5.3.4 describes the inter-block wiring delay model. Finally, 
Section 5.3.5 describes the clock-period model. 
83 
Comp¡ Wire Compj 
(a) 
Vdd Wire modef 
1 Aw 
Comp¡ 
l 
Compj 
(b) 
Figure 5.8: Wire: (a) RT model, (b) equivalent RC delay model. 
5.3.1 Electrical Models 
T.he lumped RC model, also called the Elmore delay model [PeRu81], is 
widely used for delay calculation. In this model, the propagation delay along a 
path from the start point to the end point (tp(start,end)) is computed as a product 
of lumping all of the resistances Ri and capacitances Ck along the path, that is, 
tp(start, end) = ¿ Rj x ¿ Ck. (5.10) 
. j k 
We can use Equation 5.10 to obtain the delay of a connecting wire between 
two components as shown in Figure 5.8( a), or between two blocks as shown in 
Figure 5.7. In CMOS technology we model a component as having input capac-
itance (Gin) and output resistance (Rout), as shown in Figure 5.8(b). For the 
connecting wire, we use the well known 7r-model that models a wire as an in-
put capacitance (Cw/2), wire resistance (Rw) and the output capacitance (Cw/2). 
84 
Since a wire is a thin sheet of metal of a fixed thickness, defined by the fabrication 
process, and lay out as rectangular segments, the wire resistance is equal to the 
.product of the sheet resistance (Rs) in Ohm/square and the ratio of the wire length 
(Lw) and wire width (Ww), that is, Rw = Rs(Lw/Ww)· The wire capacitance (Cw) 
is equal to the product of the wire area and the ratio of the dielectric constant (e) 
to the wire thickness (t), that is, Cw = (LwWw)(c/t). 
We can compute the propagation delay of a wire net netk (tp(netk)) used by 
a component (compi) to drive load components (compj, 1 ~ j ~ n) as 
n 
tp(netk) = (Rout(compi) + Rw)(Cw + L Cin(compj)). (5.11) 
j=l 
. Thus, the delay for signals to propagate from the input of compi, through netk, to 
one of comp¡ 's driven-components compj, is 
(5.12) 
where tp( compi) is the interna! delay of component compi. 
5.3.2 Datapath Delay Model 
To compute the propagation delay from one datapath component t.o another 
in the same datapath block requires two elements: interna! delay of the component 
and wiring delay. Typically, the interna! delays of components are provided by the 
targeted component library. 
85 
The actual wiring length can be determined only after the completion of 
computational expensive datapath placement and routing procedures. For sim-
plicity, we assume that the average wire length of a net connecting any two units 
in the same datapath is equal to half of the datapath height (Hdp) (Figure 5.2). 
In the first layout architecture, Hdp is equal to the sum of the height of all dat-
apath units. Whereas, in the second layout architecture, Hdp is proportional to 
the number of transistors in the bit slice and the transistor pitch as described in 
Section 5.2.1. Thus, the average wiring resistance and capacitance are calculated 
as: Rw(DP) = Rs((l/2Hdp)/Ww) and Cw(DP) = C,,(l/2Hdp)(Ww), where Ww is the 
width of the metall wire for the first layout architecture and the width of the 
metal2 wire for the second layout architecture. Thus, the propagation delay be-
tween two datapath components ( compi and compj) via a net ( netk) is computed 
using Equations 5.11 and 5.12. 
5.3.3 Control Delay Model 
There are two commonly used layout architectures for a control unit: random 
logic and PLA. Since the timing information for a PLA is usually provided by its 
generator, in this section I describe the random-logic timing-model for a control 
unit. 
86 
In the control-unit model as described in Section 5.2.2, each next-state and 
control signal is represented as sum of products of the present-state and condi-
tional/ status signals, as shown in Figure 5.4(b ). The product term is implemented 
with AND gates and the sum with OR gates. However, the target component 
library will usually provide AND and OR gates with a limited number of inputs. 
Thus, to realize the impact of the technology mapping, the sum and product terms 
need to be decomposed into a multi-level implementation when the large AND or 
OR gates are not available in the target library. 
The multi-level decomposition aims to produce an implementation with the 
minimum number of levels. This is guided by the fact that a multi-level imple-
mentation of a product term with 1 number of literals using AND gates with a 
maximumof n inputs is in the form of an n-ary tree [ChWG91]. Similarly, the same 
decomposition scheme can be used to obtain a multi-level OR implementation of 
the sum term. 
The capacitive load of each control signal, Ccuload, that drives the datapath 
units is proportional to the size (bit-width) of the datapath. If Ccuload is high, 
buffers are usually inserted to reduce the delay caused by the heavy load. Let us 
examine the loading eff ect in our model. If the buffer is not inserted, the last O R 
gate (i.e., the gate that is represented by the root of the OR tree) has to drive 
Ccu1oad· Thus, the delay caused by this load equals Rout(OR(m)) X Ccuload, where 
mis a maximum number of inputs of an OR gate available in the library. However, 
87 
if a buffer, BU F, is inserted, the delay caused by the load and the additional buffer 
equals ((Rout(BU F) X Ccu1oad) + tp(BU F)). Therefore, to realize the infiuence of 
buffers insertion, we assume that each output of the control logic is driven by a 
buffer, BUF, if (Rout(BUF) X Ccuload)+tp(BUF)-(Rout(OR(m)) X Ccuload) <O. 
Figure 5.9 shows an example of a multi-level implementation of a sum-
of-products expression. Each product term in the sum-of-products expression, 
Figure 5.4(b ), requires a 5-input AND gate. If the targeted library provides only 
AND gates with a maximum of three inputs, ali product terms are decomposed 
into a multi-level implementation, which is represented by a trinary tree shown in 
Figure 5.9(a). The equivalent gates implementation of this trinary tree is shown 
in Figure 5.9(b ). 
In our model, we assume that the random-logic is laid out as strips of stan-
dard or custom cells with input ports entering at the top and output ports exiting 
through the bottom. The number of layout strips is predetermined by the floor-
planner in such a way that the total chip area is minimized. In addition, we assume 
that ali gates that implement an output signal are placed closely in a cluster, as 
shown in Figure 5.9( e). The propagation delay from any input port of the control 
logic to an output port Oi consists of two elements: the gate delay ( tp(gate( Oi))) 
and the wire delay (tp(net(Oi)). 
88 
n1 
(a) 
a2 
~outing of { 
input nets 
Routing of { intemaln\ 
11-----------t~--....... ---TI,2 _____ .......,. ___ _._ __ --+---
12----+-+-----4~--...... ----l3..::---+-+-+--+-t-...... -~"t---~ 
1314 --........ -+---+-+--~--l-+--+-­
n,5 --1-+-~~+-t---+-+--+-+-t-+--15 ~i--+~...+-+-+-+-+-1~-+-t~ 
Clusters 01 
~~ 
11 Ti 12 i2 13 ~3 14 Ti 15 15 
.... ..... ..... . ..... ~ ..... . ... 
:. [a o1 l;l11 a1 21• 30 
l 11 
: 
l 
l Il 
-[a 1201 a21 a~~ Ja31.c; 
_r 
. r::tlC [ a33 o1 
1 
BUFn 
== --.01 
--. . 
····································= 
(e) 
..:.. -
: ] -
-
--
...::. 
-
.:. 
• • . ;, 
..:.. 
~ ] 
--=-
-
-
-
-=-
] 
1 = 1 = 
1 
04 
Outputs 
Figure 5.9: Random-logic model: (a) decomposition of a product term, (b) a 
multi-level implementation, (e) the layout model. 
89 
Gate delay is defined as the sum of delays of gates along the critica! path. 
Using the decomposition scheme described earlier, the gate delay (tp(gate(Oi))) 
can be formulated as the sum of the delay of gates in the AND tree, OR tree and 
the output buffer, that is, 
ip(CU(Oi)) = (AND1-node X tp(AND(n))) + (OR1-node X ip(OR(m))) + tp(BUF) 
. (5.13) 
where 
tp(AN D(n)) is the propagation delay of an n-input AND gate, and 
tp(OR(m)) is the propagation delay of an m-input OR gate. 
Wire delay, tp(net(Oi)), is defined as the sum of delay of wires on the critica! 
path. Using our layout model, wires that connect gates in the same cluster are 
relatively short. Thus, the wiring resistance and capacitance of these nets are 
negligible. Hence, the wiring delay tp(net(X, Y)) of a net that connects the output 
of a gate of type X to an input of a gate of type Y in the random logic can be 
derived from Equation 5.11 with Rw and Cw equal to O, that is, tp( net(X, Y)) = 
Rout(X) X Cin(Y) • 
Using properties of the decomposition tree, tp( net( Oi)) can be formulated 
as the sum of the wiring delays of nets in the AND tree, O R tree, the net that 
connects the AND and OR tree, and the net that connects the last OR gate (i.e., 
the OR gate that is represented by the root of OR tree) to the output buffer, that 
90 
IS, 
tp(net(Oi)) = (AND1-net X ip(net(AND(n),AND(n)))) + 
(OR1-net X ip(net(OR(m), OR(m)))) + 
tp(net(AN D(n), OR(m))) + tp(net(OR(m), BU F)). (5.14) 
Thus, the propagation delay from any input port to an output port Oí is 
(5.15) 
5.3.4 Inter-Block Wiring Delay Model 
Similarly, we can use Equation 5.11 to compute the propagation delay of a 
net netk that connects two blocks A and B on a _chip. To obtain the wire length 
of the net, we use a simple cluster-growth algorithm to form the chip f:loorplan. 
In summary, the el uster is grown from the lower left comer and the algori thm 
iteratively adds blocks on the top and right. The main reason for using this simplied 
method is the considerably low computation e:ffort. The algorithm determines the 
order of the block to be placed according to the cost of the resultant area and the 
connectivity of that block with those already placed. In another word, the objective 
function selects a block Bi E Bunplace with max(Area(Bi)) X "Ew(Bi, Bj), where 
Bj E Bplace and w is the number of wires. Subsequently, a placement position for 
the selected block is determined with respect to the overall connectivity and the 
aspect-ratio constraints. 
-1 
91 
As a result of the fioorplan, each block in the chip is centered at a coordinate 
( x, y). The length of a net connecting any two blocks A and B is estimated as the 
manhattan distance between the centers of the two blocks. Thus, the inter-block 
wiring resistance and capacitance are 
Rw(netk) = ( IAx;wBxl X R,) + ( jAy;wByj X R,) 
Cw(netk) = (IAx - Bxl X (Ww) X C3) + (jAy - Byj X (Ww) X C3) 
where 
Ww is the width of the routing wire, 
Ax and Ay are the x and y coordinates of the block A, 
Bx and By are the x and y coordinates of the block B. 
The propagation delay of netk is then computed using Equation 5.11. 
5.3.5 Clock Cycle Model 
(5.16) 
(5.17) . 
The dock cyde is determined by the worst register-to-register delay that 
includes the propagation delays in the control unit, in the datapath unit and 
between blocks. In our implementation, the dock computation is based on the 
FSMD, as shown in Figure 5.10 and described below. 
In Figure 5.10, the critica! path (Pathl) is from the State register, through 
the Control logic, Datapath, and Next-state logic, and back to the State register. 
92 
Control unit 
Next-state 
logic 
Path1 
Figure 5.10: FSMD clocking model. 
Thus, the dock period is the sum of the propagation delays of the State register 
(tp(State register)), the Control logic (tp(Control logic)), the Datapath (tp(DP)), 
the Next-state logic (tp(N ext-state logic) ), and the set-up delay of the State register 
(tsetup( State register)), that is, 
ic1ock tp(State register) + tp(Control logic) + tp(DP) 
+tp(N ext-state logic) + tsetup(State register). (5.18) 
tp(Control logic) and tp(Next-state logic) are computed using Equation 5.15. 
tp(DP) is determined by the worst register-to-register delay in the datapath. For 
Clock 
R1 R2 
j __ 
---------- --
__ R_a_ ------¡--
MAX ( tp (R1) , tp(R2)) 
MAX (tp(n~ ,tp(ng) 
Figure 5.11: The register-transfer path. 
93 
example, a typical register-to-register delay in a datapath is shown in Figure 5.11. 
It includes the delay through the source storage-unit (M AX(tp(Rl), tp(R2))), func-
tional unit (tp(FU)), connections (M AX(tp(n1 ), ip(n2 )) and ip(n3)) and the setup 
time of the destination storage-unit (tsetup(R3)). Since the access time to a RAM 
is slow and often takes several dock cycles, we consider the storage units, Rl, R2 
and R3, as registers or a register file . Connections n1 , n 2 and n3 are implemented 
as wires or interconnect units such as muxes or buses. Thus, for a single-cycle 
operation op¡ or a single-cycle chaining operation, the register-to-register delay of 
operation ºPi is computed as 
M AX(tp(Rl), tp(R2)) + M AX(tp(n1 ), ip(n2 )) 
+ip(FU) + ip(n3) + isetup(R3) (5.19) 
and for n-cycle operation opi, the register-to-register delay per clock-cycle of op-
eration op¡ is 
(M AX(tp(Rl), tp(R2)) + M AX(tp(n1 ), tp(n2 )) 
+tp(FU) + ip(n3) + isetup(R3))/n. (5.20) 
Thus, the worst register-to-register delay for each clock-cycle among all operations 
IS 
(5.21) 
94 
5.4 Results 
The experiments consist of two parts: area estimation and timing estimation. 
Section 5.4.1 describes the experiments on the area model. Section 5.4.2 describes 
the experiments on the timing model. 
5.4.1 Area Measure 
We have tested our layout models on 4 designs with 16 di:fferent implemen-
tations of the elliptic filter benchmark (Figure 5.12). Each implementation uses 
di:fferent number of registers and muxes. The a coefficient was calculated based on 
the VTI 1.5-µm datapath library (VTl88]. The final layouts were generated using 
Mentor Graphics GDT tools. The layout architectures used in Figure 5.12 and 
Figure 5.13 correspond to those described in Section 5.2.1, in which the Layout 
Architecture 1 uses 13 over-the-cell routing tracks for each bit slice. Since the 
multiplier is treated as a macrocell, its area remains constant throughout all the 
examples, and is not included in the results. Figure 5.14 show that 64 area mea-
sures using di:fferent combinations of layout architectures, muxes or buses ( one 
tri-state buffer for each mux input). The results show that 90% of the estimates 
are within 90% accuracy. 
** Area not including multiplier 
Design tof #of / # Mux #trks. Reg. Mux. lnputs #trs. #neis ~ctual (est.) 
10 11/34 552 27 11 ( 15) 
A 11 10 / 33 564 27 11 ( 12) 
12 8 / 31 572 27 11 ( 12) 
13 9/ 33 604 28 1O(15) 
10 8/ 30 472 23 10(11) 
11 6128 480 22 9(10) 
8 
12 6128 500 23 9(11) 
13 6129 524 24 9(10) 
10 7 / 30 480 20 8(11) 
11 5/27 480 19 10(12) 
e 
12 5 / 28 504 19 9(11) 
13 6 / 31 540 23 9(11) 
10 10/ 36 508 23 1O(11) 
11 6128 482 20 9(11) 
D 
12 6126 492 21 8(10) 
13 5 / 23 490 21 8(11) 
A: 17-step, 3-adder, 2-piped multipliers. 
B : 19-step, 2-adder, 2-multiplier. 
95 
L~ut Archltecture 1 Layout Archltecture 11 
Es~Area ~ct~IArea Est. Es~Area Act~Area Est. (u I bit) (u I bit) TciUai (u I bit) (u I bit) AciUai 
129,680 136,720 0.95 213,625 193,117 1. 11 
124,080 138,080 0.90 200,216 195,038 1.03 
125,840 138,480 0.91 200,796 195,490 1.03 
143,653 145,040 0.98 226,625 199,430 1.14 
103,840 113,440 0.92 166,234 156,420 1.03 
105,600 113,760 0.93 156,420 151,726 1.03 
110,000 117,280 0.94 165,658 156,862 1.06 
115,280 122,080 0.94 167,860 163,282 1.03 
105,600 113,536 0.93 160,369 146,464 1.09 
105,600 111,456 0.95 161,611 151,836 1.06 
110,880 115,296 0.96 162,855 152,934 1.06 
118,800 126,736 0.94 179,014 168,235 1.06 
111,760 125,376 0.89 1n,093 170,976 1.04 
106,040 113,136 0.94 159,804 150,045 1.07 
108,240 115,696 0.94 159,082 149,272 1.07 
107,800 113,696 0.95 158,595 146,672 1.09 
e : 21-step, 2-adder, 1-multiplier. 
O : 19-step, 2-adder, 1-piped multiplier. 
Figure 5.12: The datapath area estimates of the elliptic filter example with mux 
implementation . . 
96 
** Area not including multiplier 
Design #of #of / #Mux #trs. #trks. Reg. Mux. lnputs #nets ~ctual (est.) 
10 11/34 ns 26 11 ( 15) 
A 11 10 / 33 784 28 10(14) 
12 8 / 31 780 28 9(13) 
13 9 / 33 824 29 8(12) 
10 8 / 30 672 21 8(10) 
11 6128 688 22 7(9) 
B 23 12 6/ 28 688 9(10) 
13 6 / 29 720 24 8(10) 
10 7 / 30 672 20 10(10) 
11 5 / 27 656 19 8(8) 
e 5 / 28 688 20 7(8) 12 
13 6 / 31 744 23 9(8) 
10 10 / 36 744 21 1O(13) 
11 6/ 28 668 19 7(9) 
D 6126 664 20 8(9) 12 
13 5 / 23 648 20 8(8) 
A: 17-step, 3-adder, 2-piped multipliers. 
B : 19-step, 2-adder, 2-multiplier. 
L~out Archltecture 1 L~out Archltecture 11 
Es~Araa ~ct~IAraa Est Es~Araa Act~IAraa Est. (u I bit) (u I bit) TciUai (u /bit) (u I bit) Tctuai 
182,888 162,240 1.13 259,600 229,164 1.13 
174,526 163,680 1.07 255,750 225,060 1.14 
171,600 162,720 1.05 242,063 217,638 1.11 
181,280 168,960 1.07 244,976 219,648 1.10 
147,840 136,960 1.08 199,176 172,912 1.15 
151,360 136,000 1.11 197,260 171,700 1.15 
151,360 139,840 1.08 203,800 187,036 1.09 
158,400 146,080 1.08 200,860 189,904 1.06 
147,840 136,896 1.08 199,176 186,816 1.07 
144,320 133,536 1.08 184,380 172,464 1.07 
151,360 137,376 1.10 192,572 172,446 1.1 o 
163,680 153,216 1.07 209,644 203,652 1.03 
163,680 148,896 1.10 225,099 203,316 1.1 o 
146,960 133,536 1.10 191.706 167,598 1.14 
146,080 132,576 1.10 196,824 171,216 1.11 
142,560 129,216 1.10 181,324 166,848 1.09 
e : 21-step, 2-adder, 1-multiplier. 
D : 19-step, 2-adder, 1-piped multiplier. 
Figure 5.13: The datapath area estimates of the elliptic filter example with bus 
implementation. 
97 
est. area I actual area 
(%) 
• 
• 
1.10 •••••••••••••••••• ·•· ...... : ••• ·······•····· •••••••••• !l ••••••••••••••••••••••• ~ ••••••• t ..........•...•.......... 
. . . ., . : 
•• • •• 
• • 
• 
•• • 1.05 
• • • 
1.0 ..... --------------------------------------------------------
. 95 
. 90 
• 
• • 
• •• • • 
·' 
• 
• 
• 
·····························•·················································································· 
110 120 130 140 150 160 170 180 190 200 210 220 230 
Area / per bit (X 1000 um sq.) 
Figure 5.14: The accuracy analysis of the datapath area estimates. 
We have investigated the "fidelity" of our area estimates. Fidelity is another 
crucial factor in the quality measure that indicates the degree of the estimated 
results correspond to the actual results. In the other words, fidelity is the deviation 
from the average error over all design points. If the error over all design points 
is always of the same magnitude then fidelity is high. For instance, Figure 5.15 
shows two examples, in which solid line represents the actual results while dash 
line represents the the estimated results. Figure 5.15(a) shows that the estimates 
well predict the actual results; that is, if we have to select the minimum-cost design 
then design C will be selected since the estimate C' predicts the minimum cost. 
Thus, the estimates in Figure 5.15(a) show high fidelity. On the other hand, the 
estimates in Figure 5.15(b) show poor fidelity since design B will be selected as 
98 
Cost 
B 
B' 
, ~ 
, \ 
, \ 
, \ 
, \ 
, \ 
- ' D' \ 
', e· - --.. 
---
\ ---
.... -
A' 
B > A > O > C Design 
B'> A'> D'> C' Good fidelity 
(a) 
Cost 
B 
A' ,• ...... 
• , -- ... -- -- ... o· 
' , C' ... 
', B' ,' ......... 
', ," 
•" 
B > A > D > C 
C'> A'> D'> B' 
Design 
Poor f idelity 
(b) 
Figure 5.15: The fidelity analysis: (a) good fidelity, (b) poor fidelity. 
the minimum-cost design according to the estimate B'. However, design B has the 
highest cost. 
We compare the "fidelity" of 7 different metrics, namely metric #1, 2, 3, 
4, 5, 6, and 7 (Figure 5.16). The numbers in Figure 5.16 represent the percent 
di:fference between the area of predicted minimal area implementation and actual 
minimal area of the design. 
For each design, we first choose the minimum cost implementation according 
to different metrics. For example, based on metric #5, we choose the imple-
mentation with the minimum number of transistors as the best implementation 
(Figure 5.12) . For design D, the implementation with 11 registers, 6 muxes, 28 
99 
Percent difference between the predicted best 
implementation and the actual minimum area of the design 
Me tries Quality mesures Layout architecture 1 Layout architecture 11 
A B e o A B e o 
1 # Register o o 1.87 10.82 o 3.09 o 16.60 
2 # Mux input 1.27 0.28 o 0.50 1.22 o 3.67 o 
3 # Equivalent 2:1mux o o o 0.50 o 3.09 3.67 o 
4 # Register + o o o 0.50 o 3.09 3.67 o # E_g_uivalent 2: 1 mux 
5 # transistor o o o o o 3.09 3.67 2.30 
6 # Register-# Mux input+ 
# Unlg_ue net 1.29 0.28 o 0.50 1.23 o 3.67 o 
7 Our layout model 1.0 o o o 1.0 o o o 
(a) 
Percent difference between the predicted best 
implementation and the actual minimum area of the design 
Me tries Quality mesures Layout architecture 1 Layout archltecture 11 
A B e o A B e o 
1 # Register o 0.7 2.5 15.2 5.3 0.7 8.3 21.8 
2 # Mux input 0.3 o o o o o 0.01 o 
3 # Equivalent 2:1mux 0.8 o o o 5.3 8.9 0.01 o 
4 # Re~ster+ # E_g_uiva ent 2: 1 mux o 2.5 2.5 o 5.3 0.7 0.01 o 
5 # transistor o o o o 5.3 o 0.01 o 
6 1#: Register-# Mux input+ # Unique net o o o o 5.3 0.7 0.01 o 
7 Our layout modal 0.3 o o o o o 0.01 o 
(b) 
Figure 5.16: Comparative study of the elliptic filter example with different design 
quality measures: (a) mux implementation, (b) bus implementation. 
100 
mux inputs ( 482 transistors) is chosen as the best implementation in which the 
areas are 113,136µm 2 and 150,045µm 2 using layout architectures I and II, respec-
tively. On the other hand, the actual minimal areas for design D are 113,136µm 2 
and 146,672µm 2 • For design D with layout architecture I, the transistor metric 
( metric #5) accurately predicts the minimum area. Sin ce the percent difference 
between the area of predicted best implementation and the actual minimum area is 
O, the entry for metric #5 and design D with layout architecture I in-Figure 5.16(a) 
is O. On the other hand, for design D with layout architecture II, the area of pre-
dicted best implementation is 2.3% (150,045µm 2 vs. 146,672µm 2 ) larger than the 
actual minimum area of the design. Hence, the number in Figure 5.16(a) is 2.3. 
Since all implementations using layout architecture 1 use less than 13 actual 
routing tracks, they do not require any extra routing tracks. Hence, the area of 
the datapath is solely dependent on the number of transistors. This is the reason 
why metric #5 can predict the minimum area implementations on all designs using 
layout architecture l. Metrics #1, 2, 3 and 4 give poor predictions because register 
and mux counts alone will not accurately predict total number of transistors in 
the datapath. Metric #6 also gives poor predictions because this metric considers 
wiring area in terms of number of unique nets, which is absent in this case. Our 
layout model (metric #7) shows accurate predictions except for the design A due 
to over-estimation in the number of routing tracks caused by our simple linear 
placement method. 
101 
Using layout architecture II, both transistors and routing tracks rnake equal 
contribution to the total area. Hence, design quality rneasures which do not con-
sider routing tracks, for exarnple rnetrics #1, 2, 3, 4 and 5, do not predict layout 
area well. Metric #6 does not do well because the nurnber of unique nets does 
not directly indicate the nurnber of routing tracks. As for our layout rnodel, the 
results show consistent fidelity. 
In addition, we have estirnated the total area (including datapath, control, 
and rnultiplier) of the elliptic fil ter benchmark with a 16-bit, 19-step, 2-adder 
and 1-piped multiplier. We have implemented two control-logic rnodels, PLA and 
random logic. The result shows that our ·layout models can predict: the datapath 
area with 10% error, the PLA area with 18% error and the random logic area with 
16% error. 
5.4.2 Timing Measure 
We have tested our timing models on the elliptical filter benchrnark. The 
experiment is divided into three parts. First, we compare our timing models 
for clock-period estimation against traditional performance measures by cornpar-
ing estimates with the actual timing. The main distinction between different 
perforrnance-estimation schemes is the granularity of the underlying model. A 
realistic timing model should consider all delay constituents of a chip. In Section 
102 
5.3, we have provided timing models for each of these constituents. Second, we 
determine the percentage contributed by these constituents to show that each of 
the constituent does in fact contribute delay to the dock period. The amount of 
delays contributed by each constituent of a chip varies across designs. Fourth, we 
show that estimates from timing models can be used to guide behavioral synthesis 
tools in the selection of design styles. 
In the experiments, the dock period is computed using Equation 5.18. For 
simplicity, we divided the delay of the dock period into three parts: Datapath 
delay, Control unit delay and Wire/load delay. Datapath delay indudes the delays 
of wiring, functional, interconnect ·and storage units as described in Equation 5.19. 
Control unit delay includes the delays of control logic, next-state logic and state 
register as described in Equation 5.15. Wire/load delay takes into account the 
global wiring delay and the overall driven-load. The first and second experiments 
are carried out in 3µm CMOS technology [GDT89], while the third experiment 
uses an l.5µm CMOS technology [VTI88]. 
In the first and second experiments, we have tested our control timing models 
on four synthesized designs of the elliptical filter benchmark with 2 adders and a 
2-stage pipeline multiplier. The delay calculation is based on a 16-bit datapath 
and a 3µm CMOS technology. All four designs are scheduled in 19 control steps 
but with different utilization of registers and muxes (Figure 5.17): (1) design A 
contains 10 registers and 36 mux-inputs, (2) design B contains 11 registers and 28 
103 
mux-inputs, (3) design C contains 12 registers and 26 mux-inputs, and ( 4) design 
D contains 13 registers and 23 mux-inputs. 
Using the layout area model described in Section 5.2, the elliptic-filter bench-
mark is laid out in three blocks: a control unit, a datapath and a 2-stage pipelined 
multiplier (macrocell). The delay of the control unit is obtained by running the 
GDT simulator. The delays of the datapath and the multiplier are obtained di-
rectly from the library. All of the delay calculations take into account wiring delay. 
The actual (performance optimized designs) and estimated dock period is shown 
in Figure 5.17( a). 
Figure 5.l 7(b) shows comparison of traditional timing-estimation schemes, 
our timing models and the actual dock period. And from results in Figure 5.17( e) 
we can draw the following observations. Estimators that use only delay of func-
tional units provide estimates with an average of 31.9% error (Figure 5.17(c)). 
Estimators that use only unit delays in the datapath (i.e., registers, functional 
units, muxes, etc.) in the dock period estimation provide estimates with an aver-
age of 18.2% error. Estimators that obtain dock period estimation by considering 
datapath and wiring delays give result with an average of 7.5% error. However, 
using our timing models that consider all constituents of a chip and technology 
factors giving the results with an average of 2. 7% error. 
104 
Ellplical ~!Wr ContJOI dtlay WllWloMI del11Y Clodl pe!lod 
deoigne wltl Dalapdl (n•) (n•) (n•) 
1'1 control et.pe, delay 
2adder91nd (ne) Ellllm• Acu.1 Eelm• Ai:UI Eellma• AcUll 
a 2-et pipelned m~l 
A (10reg,3&nux-llp) 152.0 111.4 14.3 20.2 20.8 1111 .8 184U 
B (11reg, 28mux-llp) 151.1 18.8 1U 111.8 20.8 188.5 183.3 
e (12reg, 2&nux-llp) 149.3 111.5 14.1 111.11 21.2 187.7 184.8 
O (13reg, 2.3mux-llp) 152.5 18.5 12.4 111.11 20.2 190.11 185.1 
(a) 
Comparlson ot traditlonal performance measures and the proposed tlmlng modela 
Clodc period 
(na) 
200 
195 
190 a 
185 
180 
175 o 
170 
165 
lllll 
155 
.... 
150 
145 
1-40 
135 
130 
125 
120 
A 
Ellipticalfilter 
designa with 
19 control stepa, 
2 adders and 
a 2-st pipelined mult. 
A (10reg,36mux-ilp) 
B (11reg, 28mux-Vp) 
e (12reg, 26mux-ilp) 
O (13reg, 23mux-ilp) 
a 
o o 
B e 
(b) 
· Percent error of clock-period estimatH 
using various performance measures 
Using only Using only Using 
+ Ñ:llal lmlng 
CJ Ealm .... uling our Jmlng ~ 
O Ealinalla Uling ~ lnd whtng del~• on>t 
A Ealinai. ualng ~ delay of'iy 
I Edmalw ualng Mlcllonal unll11 delay only 
a 
o 
o 
Using our 
function units datapath da1apath and 
dela y del ay wiring delays timing modela 
32.6% 18.7% 7.9% 2.5% 
31.3% 17.4% 6.9%. 3.4% 
31.2% 19.1% 8.3% 1.7% 
31.9% 17.6% 6.9% 3.1% 
(e) 
Figure 5.17: The dock period for four designs of the elliptical :filter benchmark: (a) 
table of data, (b) comparison of different timing estimation schemes, (e) percentage 
error of each estimation scheme. 
Percent 
I~ 100 
96 
90 
86 
80 
75 
70 
65 
eo 
-
-
..., 
-
-
~ 
~ 
-
* 
10.6% 
10.2'11. 
79.2% 
Eatm.led Actu• 
A study of delays contrt>uted by constituents of a chip to the total clock perlad 
11 .0% 10.<4% 11 .3% 10.8% 11.5% 10.A% 1o.K 
7.7% 11.K 8.2% 11.11% 11.7% 
8.7'11. 
7.6% 
81.3% 79.7'11. 
82.5" 
711.5" 80.9!1. 79.11% 82.A% 
O.signa 
L.__A____J 
Figure 5.18: Delay distribution of constituents of a chip. 
105 
Using data obtained in the first experiment, we derive a distribution bar-
chart shown in Figure 5.18. The chart shows that the dock period comprises of 
delay contributed by each constituent of the chip, as follows: 
• an average of 80% of the dock period 1s contributed by the delay in the 
datapath units, 
• an average of 10% of the dock period is contributed by the wiring and its 
driving load, and 
• an average of 10% of the dock period is contributed by the control-unit delay. 
Because the elliptic-filter benchmark is a datapath-dominated design, the main 
contributor of the the dock period is the datapath dela y. However, the amount of 
contribution by each constituent to the dock may vary from design to design. 
106 
(ns) 
Design Datapath Estimated Wire/load Estimated To tal execution time dela_y_ control dela_y_ dela'L clockJLeriod for one iteration 
A(17c.s.) 41.0 16.8 7.0 64.8 1101.6 
8(19c.s.) 37.5 15.6 7.0 60.1 1141.9 
C(21 c.s.) 37.5 18.5 6.0 62.0 1302.0 
D(19c.s.) 41.0 19.5 6.0 66.5 1263.0 
c.s. : control steps 
Figure 5.19: The estimated dock period and total execution time of four different 
designs of the elliptical filter benchmark. 
In the third experiment, we have tested our timing models on four synthe-
sized designs of a 16-bit elliptical filter benchmark with four different schedules 
and design styles (Figure 5.19): (1) design A with 17 control steps, 3 adders, 2 
multipliers, 10 registers and 34 mux-inputs, (2) design B with 19 control steps, 2 
adders, 2 multipliers, 11 registers and 28 mux-inputs, (3) design C with 21 control 
steps, 2 adders, 1 multiplier, 10 registers and 25 mux-inputs, and ( 4) design D 
with 19 control steps, 2 adders, one 2-stage pipelined multiplier, 10 registers and 
28 mux-inputs . The delay computation in this experiment is based on an l.5µm 
technology [VTI88]. 
The results in Figure 5.19 show the total execution time (for one iteration) of 
four designs. We can use these estimates to guide the selection of designs that will 
satisfy a given performance constraint. For instance, if the performance constraint 
is 1 OOOns there are two designs, Design A and Design B, that can satisfy the 
107 
constraint. On the other hand, if the performance constraint is 1600ns all four 
designs satisfy the constraint. 
5.5 Conclusions 
In this chapter, we ha ve presented layout models for area and performance 
measures for behavioral synthesis. Since the datapath area-model has taken into 
account most of technology factors, including layout architecture, component li-
brary, placement and routing, our preliminary results show high fidelity and accu-
rácy for datapath area estimates. Similarly, the datapath timing model also can 
well predict the datapath delay. 
In the control-unit area and timing models, since we ignored the impacts of 
logic optimization, the simple-minded bounds may not be adequate when more 
accurate estimates are needed. Better estimates can be obtained by generating 
more accurate netlists for the control logic and by modeling placement and routing 
algorithms more accurately. 
One approach is to generate a large netlist for the control-state table and 
perform a limited technology mapping by decomposing each AND or OR gate 
into series of gates from the given library. Placement can be modeled by prob-
abilistic distributions of pin positions and wire lengths. For example, Kurdahi 
108 
and Parker [KuPa89] assume a uniform distribution of pins across R rows of cells 
and a geometric distribution of the wire length with averáge wire length obtained 
experimentally by running many examples. Routing algorithms are approximated 
by computing routing density across each channel. 
To obtain even better estimates, more accurate models of the placement and 
routing algorithms must be used. Pedram and Preas [PePr89] model a placement 
algorithm that minimizes the sum over all nets of the half-perimeter length of the 
rectangle enclosing pins of each net. They also model global routing by approximat-
ing a minimal rectangular Steiner tree for connecting pins on each net and channel 
routing by approximating the left-edge algorithm. Instead of modeling placement 
and routing algorithms, sorne linear algorithms can be used to obtain even more ac-
curate estimates. Zimmermann [Zimm88] uses the well-known min-cut algorithm 
[FiMa82] to quickly generate an acceptable floorplan from netlist of components 
with known area aspect ratios . In addition, Kurdahi and Ramachandran [KuRa91] 
combine the analytical and constructive methods to provide fast and accurate area 
estimates for standard cell layouts. 
The main drawback of obtaining high-accuracy estimates from RT schemat-
ics is that it will greatly increase the computational complexity. However, long 
estimation times are not desirable in behavioral synthesis. If the estimate turn-
around time is the main concern, then a simple and fast area measure should be 
used and the fidelity of the estimates are more important than the accuracy. 
109 
In essence, "accuracy" and "fidelity" are two main factors in quality mea-
sures. In order to improve accuracy, the impacts of control-logic optimization and 
floorplanning need to be study further. Furthermore, high-fidelity and fast qual-
ity measures are needed to support behavioral synthesis, and extensive empirical 
study is necessary before any model can be objectively established. 

Chapter 6 
A U nified Model for Behavioral 
Synthesis 
In order to incorporate layout information into behavioral synthesis, a unified 
model is needed to noticeably reflect the behavior and the structure of the design. 
U sing such a model, synthesis tools can retrieve layout information in any design 
stage and use this information to guide the design process. 
This chapter presents a unified model that is a structural representation by 
grouping nodes and edges of a control/data flow graph (CDFG). This modelen-
capsulates both the behavior (CDFG) and the structure of the design (Figure 6.1); 
that is, the alternatives of the design are represented using different graph configu-
rations ( e.g., different structural designs) that still encapsulate the same behavior. 
Two graph formations for two different target architectures, a point-to-point (ran-
dom topology) datapath with a one-phase dock, anda multi-bus (linear topology) 
datapath with a two-phase dock, are presented. 
110 
Path2 
Behavior description 
Structure 
1Datapath1 
Layout 
Control 
section 
Figure 6.1: A unified model for behavioral synthesis. 
111 
Using the proposed unified model and the layout model discussed in Chapter 
5, the synthesis tool can evaluate physical information during the behavioral syn-
thesis process ( Path1). A layout-driven unit-binding approach using the proposed 
unified model and the layout model is discussed in Chapter 7. Furthermore, us-
ing this model synthesis tools can feed back the physical information to guide the 
design process (Path2). A feedback-driven approach for clock-cycle estimation is 
also discussed in Chapter 7. 
The remainder of this chapter is organized in the following manner. Section 
6.1 describes the control/data flow graph (CDFG) representation. Section 6.2 
112 
presents the structural graph model. Sections 6.3 and 6.4 describe the relationship 
between the graph and the structure including datapath formation, control-unit 
formation and chip formation for two different datapath architectures (random 
and linear topologies) and clocking schema (one-phase and two-phase). Section 
6.5 presents a unified view from system to module using the proposed unified 
model. Finally, Section 6.6 summarizes the proposed unified model. 
6.1 CDFG: Control/Data Flow Graph 
A behavioral description is usually converted into a hierarchical CDFG Q = 
(V,E). V= {Vi 1i=1 .. n} denotes a set of behavioral supernodes. type(Vi) 
E {op,if,case,for,while,join} denotes the type of the supernode, an opera-
tional or a control supernode. Let G=(V,E) denote a data-flow graph (DFG), 
where V = { Vj 1 j = 1 .. m} represents a set of operational nodes and E 
{ ejm I { Vj, vm} E V} represents a set of data-dependency edges. type( Vj) E 
{ mult, add, sub, shift, comp, .. } denotes the type of the operational nodes. 
Each control-supernode such as if, case, for, while,join, consists of a DFG 
that determines the conditional branch decisions, while each operational-supernode 
consists of a DFG performing data computations. Figure 6.2(b) shows a CDFG 
that corresponds to the VHDL program in Figure 6.2( a). This CDFG consists of 10 
113 
control-supernodes in which type( Vi )={for }, type( V9 )={ i/}, type( V4 )={case}, 
type( Vs )={ while }, type( Vi )={join} and {Vi 1 i = 2, 6, 7, 8, 9}={ op}. 
E ={ eik 1 {Vi, Vi:} E V} U {ejm 1 {vj, vm} E V} denotes a set of edges 
that contain three types of edges, control edge, data edge and timing edge; that 
is, type( e¡k) = {control, data, timing }. The control-edge indicates the mutually 
exclusive branch conditions of control supernodes. The data-edge denotes the 
data dependency between nodes. Each edge ejm corresponds to a variable var( ejm). 
Finally, the timing-edge indicates the timing constraint between two supernodes 
or two operational nodes. For example, e1 in Figure 6.2 represents the timing 
constraint between supernodes V4 and Vi 0 • Similarly, e2 represents the timing 
constraint between two operational nodes in Vi. 
6.2 The Supergraph Model 
Let H =(V, E) denote a supergraph. V={ Vi 1 i = 1 .. n} denotes a set of 
structural supernodes. Each structural supernode represents a particular struc-
tural cornponent such as a functional unit, a storage unit, an 1/0 port or a control 
unit. There are four types of structural supernodes, 1/0 port, storage, functional-
unit and control, that are specified by type(Vi)={I/O,MEM,FU,CNT}. Let f(Vi) 
denote the function of the supernode Vi. For instance, if Vi is a storage supern-
ode then Vi can be a latch, a flip-fiop, a register, a register file, a RAM or a 
114 
entity example is 
port( ... ) 
end example 
architecture body of example is 
signal 1: BIT(O to 3); 
begin 
process begin 
for 1 in O to 5 loop 
{OP1}; 
end loop; 
if (X< 10) 
case Y: 
1: l0P21; 2: OP3; 
3: OP4; 
end case; 
else 
while Y< 10 loop 
{OP5}; 
end loop; 
end if; 
end process; 
end body; 
(a) 
--••~ :control edge 
----.. ~ :data edge 
- - - - ..- :timing edge 
-----
...... ___________ _ 
end 
(b) 
Figure 6.2: A hierarchical control/ data fiow graph representation: (a) a VHDL 
program, (b) the corresponding CDFG. 
115 
ROM, that is, f(Vi)={latch, ff,reg,regJile,RAM,ROM}. Similarly, if Vi is an I/O 
supernode then f(Vi)={in,out,in_out}, if Vi is a functional-unit supernode then 
f(Vi)={adder/subtracter,ALU,counter,shifter .. }, and if Vi is a control supernode 
then f (vi)= { op,if, case,for, while,join}. 
Storage supernodes represent a set of storage components for storing a set 
of variables ( data-dependency edges ). Each storage supernode consists of a set 
of ports that depends upon the function of the supernode and its corresponding 
structural component. For instance, if a storage supernode Vi denotes a latch with 
a load control, then Vi consists of three ports, in, out and load. On the other hand, 
if a storage supernode Vi denotes a register file with one read port, one write port 
and one address port, then Vi consists of three ports: read, write and address. 
Since each latch, flip-flop or register is a single storage unit, if a storage 
supernode Vi is a latch, a flip-flop or a register, that is, f (Vi)={ latch,ff,reg} then 
vi contains a set of variables, that is, {var(ejm) 1 {vj, Vm} E V} E Vi. On the 
other hand, a register file consists of an array of storage units. Let reg _cell denote 
a storage unit in a register file. Vi={reg_celh 1 k = l .. n} denotes a register-file 
supernode Vi with n storage units in which each unit contains a set of variables. 
Further, a RAM or a ROM consists of a two-dimensional array of storage units. 
Let mem_cell denote a storage unit in a RAM or a ROM. Vi={mem_celljk 1 j = 
l .. n, k = l .. m} denotes a RAM or a ROM Vi with n x m storage units in which 
each unit contains a set of variables. 
116 
Each functional-unit supernode represents a particular functional unit con-
taining a set of operational nodes that can be executed by this functional unit. 
Each functional-unit supernode also consists of a set of ports that depends upon 
the function of the supernode and its corresponding structural component. For 
instance, if a functional-unit supernode Vi denotes an AL U with eight functions, 
then Vi consists of four ports, inl, in2, out and select. 
As discussed in Section 6.1, each supernode in the CDFG contains a DFG. 
Control supernodes correspond to the behavioral supernodes of the CDFG. Each 
control supernode contains the schedule of the DFG in a behavioral supernode. 
Let tJ denote a time-step j. Vi={ tj 1 j = l .. n} denotes a control supernode Vi 
with a n-time-step schedule. Each time-step node consists of a set of operational 
nodes that are executed in this time step, that is, t1={v¡EV}. 
E={ eij 1 {Vú Vj} E V} denotes a set of superedges. There are two types 
superedges, control and data, that are specified by eij = { cnt, data}. In addition, 
w( eij) represents the weight of eij. Each data superedge represents the physical 
connection between two supernodes, while control superedges represent the control 
fl.ows and branch conditions. The weight of a data superedge is the number of 
variables communicating between the two supernodes. In addition, the superedge 
direction depends on the direction of control/ data fl.ow between the supernodes. 
For example, a superedge e12 is connected from V1 to V2 so that e12 is an outgoing 
superedge of V1 while ei2 is an incoming superedge of V2. Since certain supernode 
117 
(functional unit) inputs are non-commutable, each superedge uses a flag to indicate 
the input position of the connecting supernode. 
6.2.1 Supergraph Formation 
We are given a CDFG, a set of functional units and storage units, the schedule 
and the variable/operation assignments (the operation/variable binding algorithm 
is discussed in the next chapter). The supergraph-formation algorithm folds the 
CDFG into a supergraph in two steps: supernode formation and superedge for-
mation. Using the FSMD model described in Chapter 3, we divide the graph into 
two parts, datapath and control unit. 
For the datapath, the supergraph-formation algorithm maps the input/output 
ports, functional and storage units to a set of structural supernodes. Then, the 
algorithm maps variables and operators to their corresponding supernodes. For 
example, Figures 6.3(a) and (b) show a VHDL program and its corresponding 
CDFG. Using an adder, a multiplier anda comparator, a scheduler partitions this 
CDFG into 6 time steps. In addition, the allocator assigns three registers for stor-
ing variables as: variables {a, d, h} , {e, b, g} and {e, f, k} are stored in registers 
Ri, R2 and R3, respectively. 
In the first step, the supergraph-formation algorithm first maps the com-
parator, multiplier and adder to supernodes V7, Vs, and Vg, registers Ri, R2 
118 
entity example is· 
port (a: in BIT(O to 3); 
b: in BIT~O to 3); 
e: in BIT O to 3); 
g: out 81 (O to 3); 
_ h: out BIT(O to 3); 
k: out BIT(O to 3)) 
end example 
architecture body of example is 
begin 
process begin 
if (a> b) 
h ,. (a+b)x((a+b)xc); 
g = (b+c)+((a+b)xc); 
else 
k = a+b; 
end if; 
end process; 
end body; 
(a) 
Datapath section 
D : Extemal signal 
Q : interna! signal 
(e) 
(b) 
Control section 
Figure 6.3: The graph formation: (a) a VHDL program, (b) the CDFG, (e) the 
supernode formation. 
119 
- - - - - - Dependency edge. 
--- Hyperedge. 
(a) 
(b) 
9e~eg3 Qz Q Qz 1 2 \ , 
\ 1 \ , 
\ 1 ' , 
' , a b, , e 
\ 1 \ ,' 
' Commute 
Reg3 
and 
merge 
FU 
(e) 
1 1 
@ 
(d) 
Figure 6.4: Superedge merging 
1 
l 
120 
and R 3 to supernodes V4, V5 and V6, and input/output ports a, b, e, h, g, k to 
supernodes V1, V2, V3, V10, V12 and V13, respectively. In the second step, the 
supergraph-formation algorithm folds variables and operations into their corre-
sponding supernodes. . Figure 6.3( e) shows the supergraph generated from the 
CDFG shown in Figure 6.3(b ). The set of operational nodes { comp} is assigned 
to V7, {multl,mult2} to Vs, and {addl,add2,add3,add4} to Vg. In addition, 
the variables are assigned to supernodes V4, V5 and V6, such that V4={a,d,h}, 
V5={b,e,g} and V6={c,f,k}. 
In the superedge-formation step, the algorithm maps dependency edges to 
. superedges one at a time. If a dependency edge can share the same path with 
another superedge, then the edges can be merged into a single superedge; otherwise 
a new superedge has to be created. Consider Figure 6.4(a). If edges a and b are 
connected from variables z3 and z4 in Reg3 to the right input of opl and op2, then 
edge a is mapped to the superedge e1 and edge bis mapped to the superedge e2. 
Since e1 and e2 are connected to the right input of FU (i.e., sharing the same 
signal path), they can be merged. Figure 6.4(b) shows that edge a connects to 
the right input of opl and edge b connects to the left input of op2. Thus, edge 
a is mapped to the superedge e2 connected to the right input of the supernode 
(FU), while edge bis mapped to the superedge e1 connected to the left input of 
the supernode (FU). If the inputs of this supernode are not commutable, then 
these two superedges cannot be merged. Otherwise, these two superedges, ei and 
121 
e2, (Figure 6.4(c)) can be merged by commuting edges b and c. Figure 6.4(d) 
shows that a register supernode has only one input. This means that incoming 
superedges from the same functional-unit supernode can be merged. Applying the 
superedge formation on the example of Figure 6.3( e) results in the final supergraph 
shown in Figure 6.5( a). 
For the control unit, the supernodes in the CDFG correspond one-to-one with 
the control supernodes. Each control supernode consists of a set of time steps, and 
each time step consists of a set of operations that can be executed in the same 
time step. For example, the operational node comp in V2 is assigned to the time 
step t2. Thus, the control supernode V15 consists of a time step t2 which contains 
the operational node comp. In addition, the control superedges correspond to the 
control edges in the CDFG. 
Algorithm 6.1 describes the supergraph formation. The input to the algo-
rithm includes a CDFG, a set of resources, a schedule and the operation/variable 
binding result ( the binding procedure is discussed in the Chapter 7). The proce-
dure supernode_mapping(} maps the given resources to a set of supernodes. The 
procedure var_op_mapping(} assigns the operations and variables to their corre-
sponding supernodes. The function superedge_merge_check(e) returns "true" if the 
given edge e can be merged with the existing superedge, otherwise returns "false". 
The procedure inserLnew_superedge(e) creates a new superedge for edge e. The 
procedure controLstep_assignment(} assigns operations to the time steps of the 
122 
996(2) 
e 
(b) 
Figure 6.5: Supergraph formation ( cont. ): (a) the superedge formation, (b) the 
structural netlist. 
123 
control supernodes according to the given schedule. The algorithm first forms the 
datapath section of the supergraph. Then, the algorithm forms the control section 
of the supergraph. 
Algorithrn 6.1. Supergraph Forrnation. 
Let 
N be a set of i/o ports, storage units and functional units; 
S and P be the schedule and operation/variable assignrnents; 
9 ={V, E} be a CDFG; 
H = {V, E} be a supergraph; 
Supergraph_Formation(9 ,N, S, P){ 
/*datapath supergraph forrnation* / 
V = supernode_mapping(N); 
} 
/*variables and operations rnapping* / 
H = var_op_mapping(Q,S, P,V); 
/*superedge rnapping* / 
for (VeEE) do 
rnergeable = superedge_merge_check( e); 
if (rnergeable == FALSE) then 
inserLnew_superedge( e); 
endif 
endfor 
/*control-unit supergraph forrnation* / 
for (VV EV) do 
V' = supernode_mapping(V); 
V= V U V'; 
endfor 
/*control-step assignment* / 
H = controLstep_assignment(Q,H,S); 
return(H); 
Cornplexity analysis. The cornplexity analysis of the supergraph forrnation is de-
scribed as follows: 
1. Supernode mapping takes O(V) time. 
2. Variable and operation mapping takes O( V +l') time. 
124 
3. Superedge mapping takes O( E) time. 
4. Control-step assignment takes O( V) time. 
6.3 Supergraph and Structure: 1 
This section describes the relationship between the supergraph and the struc-
ture of the design based on a point-to-point datapath with a one-phase dock target 
architecture. First, Section 6.3.1 describes the target architecture. Then, Sections 
6.3.2, 6.3.3, and 6.3.4 present the datapath formation, control-unit formation and 
chip formation from the supergraph. 
6.3.1 Point-to-Point Datapath with A One-Phase Clock 
Scheme 
The target architecture is defined as follows: 
1. Datapath. The datapath uses a point-to-point architecture, as shown in 
Figure 6.6(a). Both storage and functional units are connected in a point-
to-point topology based on the directions of data flows. 
2. Control path. The controller consists of three parts: the state register, the 
control logic and the next-state logic, as shown in Figure 6.6( a). The state 
... 
~ Next-state Control ·~ logic 
Clock 
logic 
(J) 
1ü 
-en 
Clock 
Fetch next control word 
+ store data into latch 
(a) 
t i í 
Control output 
+ execution 
(b) 
125 
R1 R2 
"' ca e Clock C> 
·e;; 
E 
-8 (.) 
Latch 
Input Output 
Master Slave 
Clock 
Figure 6.6: The point-to-point datapath with one-phase dock architecture: (a) 
control/ data paths, (b) one-phase docking scheme. 
register stores the currently executing control-word. The control logic emits 
the control signals to direct the execution of the datapath. The next-state 
logic determines the next control-word. 
3. Clock scheme. An one-phase dock is used and all storage units are imple-
mented using level-sensitive master-slave (M/S) latchs. When the dock is 
"low" the M/S-latch loads the input signal into the master-latch. When the 
dock goes "high" the slave-latch fetches the data stored in the rnaster-latch. 
Using the one-phase dock, the register transfer (RT) operation consists of 
two steps: (1) when the dock is "high" the control logic emits the control 
126 
signals and the datapath executes the RT operation, and (2) when the dock 
is "low" the state register latches the next control word and the resulting 
data is stored into the destination latch, as shown in Figure 6.6(b). 
6.3.2 Datapath Formation 
In the datapath section of the supergraph, each supernode denotes a func- · 
tional unit, a storage component oran input/output port. Each superedge denotes 
a physical connection between two supernodes. We assume that a single-level in-
terconnect model is used. If a supernode has more than one incoming superedge 
entering one of its inputs, then a selector ( e.g., a multiplexer) is needed to select 
the data input from different sources. For example, Figure 6.5(b) shows the struc-
tural netlist of the supergraph in Figure 6.5(a). The register supernodes V4, V5 
and V6 are mapped to Regl, Reg2 and Reg3, respectively. The functional-unit 
supernodes V7, Vs and V9 are mapped to a comparator, a multiplier andan adder. 
In addition, input supernodes V1, V2 and V3 are mapped to input ports a, 
b and e, while output supernodes V10, V11, V12 and V13 are mapped to output 
ports h, cond, g and k . Since all the register supernodes have more than one 
incoming superedge (3 for V4 and V5, and 2 for V6), each register needs an inter-
connect unit to select data inputs from different sources (Interconnect unitl, 2, 3 
are connected to Regl, Reg2 and Reg3). For the functional-unit supernode Vs, 
127 
the incoming superedge e48 is shared by all of the operational nodes ( multl and 
mult2) in Vs (i.e., the left input of the multiplier has only one data input source 
so that an interconnect unit is not needed) . On the other hand, an intercon-
nect unit (I nterconnect unit4) is required for the right input of the multiplier to 
select inputs from two sources. Similarly, the adder needs an interconnect unit 
(Interconnect unit5) for its left input. Since the interconnect units are repre-
sented implicitly in the supergraph, a supergraph can be viewed as a structural 
representation. 
6.3 .3 Control-Unit Formation 
The control section of the supergraph denotes the control sequence of the 
design. If a control supernode has more than one outgoing control superedge, then 
this supernode is a conditional branch node. For example, the control supern-
ode V15 in Figure 6.7(a) is an if supernode which has two outgoing superedges. 
The branch decision depends on the condition status as: (1) branch to V16 when 
cond==true, and (2) branch to V17 when cond==false. Each control supernode con-
sists of a set of time steps, in which each time step consists of a set of operational 
nodes which are executed in this time step. These nodes are linked to the op-
erational nodes resided in the datapath section of the supergraph as shown in 
Figure 6. 7( a). 
128 
(a) 
Vresent Control ouput Status Nextstep statA 
Register lnterci>nnect 
Step R1 R2 R3 Unit1 Unit2 Unit3 Unit4 Unit5 
load load load s1 s2 s3 s1 s2 s3 s1 s2 s1 s2 s1 s2 oond. 
t1 1 1 1 o 1 o o 1 o 1 o o o o o o t2 
t2 o o o o o o o o o o o o o o o 0/1 t6/13 
t3 1 o o 1 o o o o o o o o o 1 o o t4 
t4 o 1 1 o o o 1 o o o 1 o 1 o 1 o t5 
t5 1 1 o 1 o o o o 1 o o 1 o o 1 o stop 
t6 o o 1 o o o o o o o 1 o o 1 o o stop 
(b) 
Figure 6. 7: Control-unit formation: (a) the control-section of the supergraph, (b) 
the control-state table. 
129 
We formulate the control section of the supergraph to a control-state table, 
which includes four parts: (1) present state, (2) control output, (3) status input 
and (4) next state (Figure 6.7(b)). Using the given component library, we can 
determine the control pins of each component. For example, we choose a 3-input 
multiplexer with 3 select-inputs for lnterconnect unitl. Therefore, there are three 
control inputs, sl,s2,s3, for lnterconnect unitl (Figure 6.7(b)). In addition, 
present state, status and next state can be mapped directly from the supergraph. 
For example, time step tl in V14 (Figure 6. 7( a)) consists of three read operations 
(rdl, rd2, rd3) to load input data from ports a, b and e to registers R1 , R 2 and 
R 3 • Consider rdl, which reads input data from port a and stores to the register 
R1 via 1 nterconnect unitl (Figure 6.5( e)). Therefore, the control outputs for 
the load input of R1 and the select input sl of 1 nterconnect unitl are set to 
one. Similarly, the load inputs of ports b and e and the select inputs s2 and 
sl of lnterconnect unit2 and lnterconnect unit3 are set to one for rd2 and rd3 
operations. Since this control supernode (V14) is nota branch node, the next state 
is the first time step (t2) of V15 (V14's successor node). On the other hand, for 
a conditional branch node V15, the next state depends on the conditional status 
(row t2 of Figure 6. 7(b) ). 
130 
(a) 
a Datapath 
(b) 
.-----' g 
.... 
'2 
~ 
1 
o 
-= e 
8 
Control section 
Figure 6.8: Chip formation: (a) the supergraph, (b) the chip structure. 
131 
Control section 
(a) 
g 
Datapath1 
-'2 
:J 
a 1 o 
-= e: 
Datapath2 8 
h 
(b) 
Figure 6.9: Chip formation with multiple datapaths: (a) the supergraph, (b) the 
chip structure. 
132 
6.3.4 Chip Formation 
U sing the datapath and control-unit formation techniques described in the 
previous two sections, we can directly map the supergraph into the FSMD chip 
architecture. Using the Figure 6.3 example, the final supergraph in Figure 6.8( a) is 
divided into four parts: Datapath section, Control section, Input ports and Output 
ports. Figure 6.8(b) shows the chip formation of the supergraph in Figure 6.8( a). 
Each section of the supergraph is mapped to a particular section of the chip. The 
Datapath section is mapped to a Datapath that can be implemented using a bit-
sliced stack or standard cells, while the Control section is mapped to a Control 
unit that can be implemented using a PLA or standard cells. Each port supernode 
is mapped to a chip-pin that consists of an 1/0 pad anda pad driver. In addition, 
the superedges across the boundaries of the datapath, the control unit and the 
input/ output ports are mapped to the routing area of the chip. 
Similarly, the chip formation can be applied to the FSMD model with multi-
ple datapaths. Figure 6.9( a) shows the supergraph in which the datapath section 
is divided into two subsections, DP 1 and DP2. In the chip formation, each dat-
apath subsection of the supergraph is mapped to a datapath on the chip, e.g., 
DP1 is mapped to Datapathl while DP2 is mapped to Datapath2, as shown in 
Figure 6.9(b ). 
133 
6.4 Supergraph and Structure: 11 
This section describes the relationship between the supergraph and the struc-
ture of the design based on a multi-bus datapath with a two-phase dock target 
architecture. First, Section 6.4.l describes the target architecture. Then, Section 
6.4.2 presents the datapath, control-unit and chip formations from the supergraph. 
6.4.1 Multi-bus Datapath with A Two-Phase Clock Scheme 
The target architecture is defined as follows: 
l. Datapath. The datapath uses a multi-bus architecture, as shown in Figure 6.10( a). 
Both storage and functional units are connected to the buses. Tri-state buf-
fers are inserted between units and buses. Storage units, such as registers, 
latches and flip-flops, are grouped into multi-port register files. 
2. Control path. The controller consists of three parts: the state register, the 
control logic and the next-state logic, as shown in Figure 6.10( a). The state 
register stores the currently executing control-word. The control logic ·emits 
the control signals to direct the execution of the datapath. The next-state 
logic determines the next control-word. 
3. Clock scheme. A two-phase nonoverlapping dock is used and all storage units 
are implemented using level-sensitive latches. Typically, the RT operation 
134 
~2 
Next-state 
logic 
Status reg. 
Status 
Clock 
~ 
Control 
logic 
f Control reg. J.- ~2 t--- ___ , 
Control signals 
(a) 
Bus1 
Bus2 -.---+---...... --+ 
Clock Register 
file 
Bus3-_. _____ ...., __ 
~,_____. 
2 3 4 
2 3 4 
1 : Load state register. 
2. Load control signals into control reg. 
3. Read operands from the register file and execute. 
4. Execute and wlite the result back to the register file 
and load the status. 
(b) 
Figure 6.10: The multi-bus datapath with a two-phase dock architecture: (a) 
control/ data paths with two-pipe stage and three-pipe stage with latch insertion 
( dash boxes ), (b )two-·phase-clock/two..:pipe-stage scheme. 
135 
is divided into five steps: (1) load the control word into state register, (2) 
decode the control signals, (3) read data from register files, ( 4) execute the 
RT operation, and (5) store the result into register files. Using such a two-
phase dock scheme, the RT operations can be implemented in a two-pipe 
stages. In the first pipe stage, the state register fetches the control word 
at </>1 and the control register fetch es the control signals at </>2 , as shown in 
Figure 6.lü(b ). In the second pipe stage, the functional units read operands 
from the register files and execute the operations at </>1; and continuing the 
execution and write the result back to the register files and the status register 
at </>2 • For sirnplicity, in this section we use the two-phase-clock/two-pipe-
stage architecture. Further improvement (three-pipe stage) can be achieved 
by inserting latches before and after the functional units (Figure 6.lü(a)). 
6.4.2 Datapath/Control-Unit/Chip Formation 
In the datapath section of the supergraph, each supernode denotes a func-
tional unit, a storage component or an input/output port. Each superedge repre-
sents a physical connection between a port of the supernode and a bus. All the 
superedges that share a cornmon port can share the same bus. In the following of 
this section, I will use the Figure 6.3 example to describe the datapath/control-
unit/chip formation from the supergraph. 
136 
Cyde PhaM Operation1 Operation2 Operation3 
1 Load state register 
1 
2 Load control signals 
1 Read externa! signals a, b and e Load state register 
2 
2 Store signals a, b and e into RF Load control signals 
1 Read variables a and b from RF Comp(a,b) => "cond" 
3 
2 Store "cond" into status register 
1 "cond" =true Load state register "cond" = false Load state register 
4 
2 Load control signals Load control signals 
1 Read variables a and b from RF Read variables a and b from RF Load state register 
5 Add(a,b) => d Add(a,b) => k 
2 Store d into RF Store k into RF Load control signals 
1 Read variables b, c and d from RF Load state register Mult(c,d) =>a,· Add(b,c)=>f 
6 
2 Store e and f into RF Load control signals 
1 Read variables d, e and f from RF 
7 Mult(d,e) => h, Add(e,f) => g 
2 Store g and h into RF 
Figure 6.11: The schedule of the CDFG example in Figure 6.3(b). 
1 
Bus1 
Bus2 
Bus3 
Register file 
w 
Reg_cell2 j b, f, h 
Reg_ce113 I e, e, g 
R/W1 R/W2 
(a) 
(b) 
137 
Figure 6.12: Datapath formation: (a) the supergraph, (b) the structural netlist. 
· I 
138 
V7 
v,~ Reg_ce112 l b, f, h 
Reg_ce113 l e, e,g 
¡- R/W1 R/W2 
Figure 6.13: The final . supergraph. 
Output 
ports 
Datapath section 
(a) 
a 
Register file 
-·2 
k 
:::J 
o 
.::::, 
e 
Datapath 8 
g 
(b) 
Figure 6.14: Chip formation. 
139 
h 
140 
Using the proposed target architecture and clocking scheme, the schedule of 
the CDFG example in Figure 6.3(b) is shown in Figure 6.11. The unit/storage 
binding result is the same as described in section 6.3 (Figure 6.5(a)) except the 
registers are grouped into a register file. This register file consists of four ports, 
one read-only, one write only and two read/write ports, and contains three register 
cells, as shown in Figure 6.12( a). Figure 6.12(b) shows the structural netlist of 
the supergraph in Figure 6.12(a). The supernode V7 is mapped into a register 
file that consists of three register cells, Reg_celll, Reg_cel/2 and Reg_cel/3 and 
four ports, R, R/Wi, R/W2 and W. The functional-unit supernodes VB, Vg and 
V10 are mapped to a comparator, a multiplier and an adder. In addition, input 
supernodes V4, V5 and V6 are mapped to input ports a, b and e, while output 
supernodes V1, V2, V3 and V11 are mapped to output ports h, g, k and cond. 
Using the multi-bus architecture, all the superedges that share the same connection 
are grouped into a bus. For instance, superedges ei, e2, ea, e4 and e5 are mapped 
to wires w1 , w2 , w3 , w4 and w5 that are connected to the bus Bus3 sharing the 
common source ( R port of the register file) via the wire w s. 
Using the control-unit formation described 1n Section 6.3.2, we can form the 
final supergraph, as shown in Figure 6.13. Similarly, a control-state table can be 
derived directly from the supergraph as described in Section 6.3.2. 
U sing the datapath and control-unit formation methods, we can directly map 
the supergraph into the FSMD chip architecture, as described in Section 6.3.3. For 
141 
instance, the final supergraph in Figure 6.14(a) is divided into five parts: Datapath 
section, Control section, Register file, Input ports and Output ports. Figure 6.14(b) 
shows the chip formation of the supergraph in Figure 6.14( a). 
6.5 Extension: A Unified View From System 
To Module 
In the previous two sections, we have shown that the supergraph is a struc-
tural representation by grouping nodes and edges of a CDFG. In the supergraph, a 
functional-unit/storage/port supernode represents a supernode ( e.g., a register, a 
functional unit or a port) containing a cluster of operational nodes or dependency 
edges in a CDFG. A control supernode is a supernode containing a cluster of op-
erational nodes that can be executed in the same control step. The alternatives of 
the design are represented using different supergraph configurations ( e.g., different 
structural designs) that still encapsulate the same CDFG (i.e., behavior). 
For instance, consider the Figure 6.3( e) example, we can group operation 
comp with other operations { addl, add2, add3, add4} in supernode V9 if we replace 
Vg with an ALU, as shown in Figure 6.15(a). We can also partition variables 
e, b and g in supernode V5 into two groups, e and b to V5, and g to V20. 
142 
Reg'4 
.- -------------~::::::::::::::: :; ',, 
,---------------~-------------~--------------- \ \ , 1 1 \ \ \ 
Reg1 
1 1 \ \ \ 
\ \ \ 
\ \ \ 
\ 
(a) 
--1 
1 
1 
_1 ___ J~-' 
1 ' 
-:,y---
- -- -----ií 
18 
v13 
Control unit 
g h 
(b) 
Figure 6.15: Hypergraph modification: (a) the supergraph, (b) the structure. 
143 
This regrouping results in an additional register (R4) but the number of in-
puts of 1 nterconnect unit2 is reduced from 3 to 2, as shown in Figure 6.15(b ). 
Furthermore, since the supergraph encapsulates both of the behavior and the struc-
ture, we can retrieve layout information from any node and edge in the CDFG. 
For example, using the structure of Figure 6.5(b) the delay of path from edge a, 
via addl to edge d (Figure 6.3(b)) is the sum of the propagation delays of Regl, 
Interconnect unit5, adder "+" and Interconnect unitl, and the set-up delay of 
Regl. On the other hand, the same path-delay is equal to the sum of the propa-
gation delays of Regl, Interconnect unit5, ALU and Interconnect unitl, and the 
set-up delay of Regl when the structure in Figure 6.15(b) is used. 
So far, I have shown the module/ chip view of the CDFG using the supergraph 
model. In the following section, I present the extension of the supergraph model 
to provide a unified view from system to module. Typically, a system consists 
of a number of boards ( or multi-chip modules (MCM)) that are connected with 
cables. Each board or MCM contains a set of chips connected with wires. Each 
chip contains a set of modules including datapaths, control units, memories and 
I/O ports. Figure 6.16(a) shows the system to module hierarchy. Figure 6.16(b) 
shows the supergraph representation. 
We can easily extend the supergraph model by adding: chip supernodes that 
consists of a set of modules (i.e., a set of functional-unit, control, storage and port 
supernodes) and board supernodes that consists of a set of chip supernodes. The 
1 
l 
144 
System 
Board ( or MCM) 
!_ 
- - - - - - · Board ( or MCM) 
1 1 
Chip 
- - - - - - - - - · Chip 
1 1 
Module Mo~ule - - - - M0Ju1e 
1 1 1 
1 
1 1 
Ch~ ----------- Ch~ 
J 
1 
1 
Mo ule Module 
1 1 (Control unit) (Datapath) (Memory) (Memory) (Datapath) 
(a) 
System 
Board1 (or MCM) Board2 ( or MCM) 
Chip1 Chip3 Modules 
Module1 
(b) 
1 
Module 
1 (Control unit) 
Figure 6.16: A unified view: (a) the system hierarchy, (b) a system to module 
v1ew. 
145 
superedge connected two supernodes in the same cluster represents the interna! 
connection in the same unit (e.g., module, chip or board), while the superedge 
connected two supernodes in the different clusters represents the externa! connec-
tion between two units. For instance, the superedge e3 represents the interna! 
connection in M odule2, while the superedges e1 and e2 represent the externa! con-
nections between (Chipl,Chip2) and (Chipl,Chip4), respectively. Since Chipl 
and Chip2 are in the same board cluster (Boardl), e1 represents the externa! con-
nection between C hipl and C hip2 in the same board. On the other hand, since 
Chipl and Chip4 are in the different board clusters (Boardl and Board2 accord-
ingly), e2 represents not only an externa! connection between Chipl and Chip4 
but also an externa! connection between Boardl and Board2. 
6.6 Summary 
This chapter presented a unified model for bridging the gap between behav-
ioral and layout synthesis. This model has three main features. The first one is that 
the supergraph model is equivalent to a structural description of the design. U sing 
the area and timing models described in Chapter 5, we can estimate the layout area 
and delay of the design from the supergraph representation. In addition, the su-
pergraph model also provides a chip structure hierarchy of the design. The second 
feature is that the supergraph encapsula tes the behavior (CD FG) of the design. 
146 
U sing this supergraph model, we can retrieve layout information during the design 
process to support design decision making and design tradeoffs. The last and most 
important feature is that this model provides a unified behavior / structure view of 
the design that is well suited for interactive synthesis. 
- 1 
1 
1 
1 
Chapter 7 
Binding Using Layout 
Information 
This chapter presents two layout-driven approaches, using the layout model 
and using the feedback layout information, for behavioral synthesis. To provide a 
fast area/timing measure in datapath design process, this chapter describes a new 
approach that combines our proposed area/ timing model discussed in Chapter 5 
and the unified model discussed in Chapter 6 for datapath optimization. We model 
the datapath as a graph representation that noticeably reflects the datapath floor-
plan and we also formulate datapath binding as a graph partitioning problem. 
Contrary to the other datapath optimization algorithms that minimize the nurn-
ber and size of registers and muxes, our algorithm evaluates layout-area quality 
during datapath optimization. Our approach provides faster and more accurate 
147 
148 
area quality measures for datapath optimization than previous approaches. In ad-
dition, this chapter also presents an approach that uses back-annotation of layout 
information to estimate the dock cycle of the design. 
The remainder of this chapter is organized in the following manner. Section 
7.1 presents our unit-binding approach for datapath optimization. First, Section 
7.1.1 defines the unit-binding problem. Then, Section 7.1.2 describes the area-cost 
function. Finally, Section 7.1.3 presents the unit-binding algorithm. Section 7.2 
describes the back-annotation approach for dock estimation. Further, Section 7.3 
presents the experimental results. Finally, Section 7.4 concludes our approach. 
7.1 Unit Binding 
Datapath synthesis consists of three interdependent binding tasks: functional-
unit binding, storage-unit binding and interconnect-unit binding. Functional-unit 
binding determines the exact mapping of the operations into the functional units. 
Storage-unit binding maps data carriers, such as variables and constants, in the 
behavioral description to storage components. Interconnect-unit binding assigns 
interconnect units and wires between functional/storage units to connect data-
transfer paths. 
1 
1 
1 1 
149 
In the past, the number and size of functional units, registers, muxes ( or 
equivalent 2-to-l rnuxes), rnux inputs and connections ( or wires) were the comrnonly-
used area measures in datapath synthesis. Consequently, a great <leal of effort has 
been devoted to minimize the number and size of registers and muxes in data-
path synthesis [C1Th90, DeNe89, DiTh89, HCLH90, LyEG90, Pang88, PaGa87, 
PaPM86, PaKG86, TsSi86]. However, these area measures assurne that the lay-
out area is directly proportional to the nurnber and size of RT cornponents and 
do not take into account layout technology factors, such as layout architectures 
or styles, component libraries, and the impact of floorplanning, placement and 
routing. These factors often greatly affect the final layout of the design. 
In this section, we describe a unit binding algorithm that combines the lay-
out model described in Chapter 5 and a graph representation for datapath opti-
mization. Contrary to the other datapath binding algorithms which minimize the 
number and size of registers and muxes, our algorithm uses the combination of 
the layout model and unified representation to evaluate area quality during the 
datapath optimization. 
7.1.1 Problem Definition 
The objective of unit binding is to assign operations to functional units and 
to assign variables to storage units so that the total layout area is minimized. 
150 
We can formulate the unit-binding problem to a graph-partitioning problem, as 
follows: given a data-flow graph, its corresponding schedule, anda set of functional 
units, partition operations and variables into a set of supernodes, such that: 
l. no two operations in the same control step can be assigned to the same 
functional-unit supernode, 
2. no variables with overlapping lifetime can be assigned to the same storage-
unit supernode, and 
3. the total area of the supergraph is minimized. 
7.1.2 Area Cost Function 
U sing this graph model, we can calculate the datapath area and control-unit 
area using the layout model described in Chapter 5. Using this model, we need 
two elements to compute the area: the number of transistors and the number of 
routing tracks. 
The number of transistors of each RT component can be obtained directly 
from the target component library. To obtain the number of routing tracks re-
quired to completely connect all nets in one bit slice, we first implement stack 
placement using the KLFM [FiMa82, KeLi70] algorithm. Then we implement 
routing track assignments using the left-edge algorithm. Since the stack place-
ment takes pseudo linear time and the routing-track estimation takes O( n log n) 
1 
1 1 
1 
1 
1 
1 1 
1 
1 
151 
time where n is the number of nets in the RT-netlist, the complexity of the area 
calculation is O( n log n). 
7.1.3 The Algorithm 
The algorithm consists of two phases: initial assignment and interchange 
optimization. Initial assignment consists of two steps: supernode formation and 
superedge formation. In the first step, the algorithm determines the minimum 
number of registers required to store all variables using the left-edge algorithm 
[KuPa87] and assigns variables to corresponding registers. The algorithm then 
assigns operations to the given functional units arbitrarily, such that no more than 
one operation in the same control step will be assigned to the same functional unit. 
Finally, the algorithm forms the graph as described in Algorithm 6.1. 
In the interchange optimization phase, the algorithm performs superedge 
merging by interchanging operations and variables that reside in the supernodes. 
superedge merging is important because it contributes to interconnect sharing. In 
the following section, we first . describe two possible ways to merge superedges by 
interchanging variables or operations: node relocation and node swapping. Then, 
we describe the interchange technique by taking into account the interdependent 
relationship between operation and variable assignments. 
152 
(a) (b) 
Figure 7.1: Superedge merging by node relocation: (a) before, (b) after. 
The first possible way to merge superedges is to relocate variables between 
register supernodes or to relocate operations between operation supernodes. A 
variable can be relocated from a source register supernode to a destination register 
supernode if and only if: the destination supernode is free during the lifetime 
of that variable. An operation can be relocated from a source supernode to a 
destination supernode if and only if: (1) the destination supernode can performs 
the function of that operation, and (2) there does not exist another operation in 
the destination supernode such that this operation is assigned to the same control 
step as the relocating operation's. We term the above conditions the "relocation 
preconditions". The node relocation can be performed if and only if the relocation 
preconditions are satisfied that is termed as a "feasible relocation". Node relocation 
allows us to relocate one node, as well as a group of nodes at one time. 
1 1 
1 
1 
l 
From other 
node 
{a) 
Figure 7.2: Node swapping. 
153 
{b) 
Asan example, consider the register supernode V3 in Figure 7.l(a). V3 is 
connected to V1 and V2 with the superedges ei and e2, respectively. Since V3 
has to select one input from two sources V1 and V2, a 2-input Mux is required. 
If node b in V2 can be moved to V1, then ei and e2 can be merged into ei2, as 
shown in Figure 7.l(b ). As a result, V3 does not need a mux for its input. 
The second possible way for superedge merging is to swap the variables be-
tween register supernodes orto swap the operations between operation supernodes. 
Node swapping can be viewed as a two-way node relocation problem. Node re-
location is performed as relocating nodes from the same source supernode to one 
or more destination supernodes if there exists a "feasible relocation". On the 
other hand, node swapping is performed when a one-way feasible relocation from 
a source supernode to a destination supernode can not be found, but a feasible 
relocation can be created by rearranging the nodes in the destination supernode. 
154 
FU 
Reg. 
FU 
(a) (b) (e) 
Figure 7.3: Interchange by considering interdependent relationship between oper-
ation and variable assignments. 
In Figure 7.2(a), assume V2 and V3 are two operation supernodes; ei and e2 can 
be merged by relocating node e from V3 to V2. If node a in V2 is assigned to the 
same control step as node c's, then node e can not be relocated from V3 to V2. 
However, e 1 and e2 can be merged by swapping node a and node e as shown in 
Figure 7.2(b ). 
In order to take into account the interdependent relationship between op-
eration and variable assignments, the algorithm determines a group of "feasible 
relocation" nodes by rearranging variables in the registers and operations in the 
functional units simultaneously. Consider Figure 7.3( a), where e4 and e5 can be 
merged by relocating variable e from V3 to V4, so that the mux in front of V6 
155 
can be eliminated. However, after relocating variable e, ei has to be split into 
two superedges eJ_J and ei-2 as shown in Figure 7.3(b) so that an additional mux 
is needed in front of V4. However, if there is a feasible relocation of operation b 
from V1 to V2, then the algorithm finds a solution to achieve overall interconnect 
reduction ~ As a result, the algorithm relocates variable e and operation b to V4 
and V2, respectively, as shown in Figure 7.3( e). 
Algorithm 7.1 describes the unit binding for datapath synthesis. The input 
to the algorithm includes a given CDFG, the schedule and a set of given func-
tional units. The algorithm first uses the left-edge-algorithm to assign variables 
into a set of registers (Procedure lefLedge_alg{)). Then, _the algorithm assigns op-
erations and variables into registers and functional units arbitrarily (Procedure 
iniLop_var_assignment{)). The procedure SupergraphJi'ormation{) forms the su-
pergraph, while the procedure layouLestimation{) returns the area-cost of the given 
supergraph. Further, the algorithm locates the supernodes such that their inputs 
connect to more than one supernode (Procedure locate_feasible_merging_superedge()). 
The superedges connected to the inputs of these supernodes are called "feasibly-
mergeable superedges". For a feasibly mergeable superedge, the algorithm locates 
a set of variables or operations associated with this superedge, rearranges the nodes 
(Procedure relocate_node()), and calculates the layout area as described in the pre-
vious section. If a smaller layout area is obtained, then a merging solution has 
156 
been found. If there is only one superedge connected toan input pin of a supern-
ode, then a mux is not needed for this input pin. Thus, this superedge achieves 
the maximum "sharing". In this case, the algorithm will "lock" this superedge 
so that the algorithm will not consider this superedge as a feasibly mergeable 
superedge. The algorithm begins with the minimum number of registers and per-
forms the allocation iteratively by incrementing the number of registers to explore 
the design space. For each iteration, the algorithm runs repeatedly until no more 
improvement can be found. 
Algorithm 7.1. Unit Binding. 
Let 
9 ={V, t'} be a CDFG; 
H={V,E} be a supergraph; 
"count" be a given arbitrary number; 
M be a set of feasible merging superedges; 
F be a set of given functional units; 
T be a set of schedule; 
P be the operation/variable assignments; 
R and Rx be a set of registers; 
UniLBinding(Q,F, T,count){ 
Rx = </>; 
while ( count > O) do 
R = lefLedge_alg(Q); 
P = init_op_var_assignment(9 ,S, R, S); 
/*See Algorithm 6.1 * / 
H = Supergraph_Formation(9,F U R U Rx, T,P); 
old_area = layouLestimation(H); 
/*lnterchange optimization * / 
no_moreJmprove = FALSE; 
while (no_moreJmprove = FALSE) do 
M = locate_feasible_merging_superedge(H); 
a_gain_merging = FALSE; 
for (V feasible_merging...superedge e E M) do 
/*relocate nodes associated with superedge e*/ 
H' = relocate_node( e); 
} 
new_area = layouLestimation(H'); 
if ( new _a.rea < old...area) do 
H=H'; 
old_area = new _area; 
a_gain_merging = TRUE; 
endif 
endfor 
if (a_gain_merging =FALSE) then 
no_moreJmprove = TRUE; 
endif 
endwhile 
/*incrementing one more register for next allocation iteration * / 
count = count - 1; 
if ( count > O) then 
Rx = Rx U register; 
endif 
endwhile 
157 
Complexity analysis. Since the algorithm performs registers and selectors tradeoffs 
in severa! runs ( outer while loop), we consider only one allocation run, which 
consists of three parts: 
1. U sing the left-edge algorithm, It takes O( m log m) time to determine the 
minimum number of registers and initial variable assignment, where mis the 
number of variables in the CDFG. 
2. The complexity of Supergraph_Formation procedure is described in Algorithm 
6.1. 
3. The complexity of layouLestimation procedure is O( n log n) where n is the 
number of nets in the netlist. 
158 
----
, ... 
' .. \ opO ) 
.. , 
.... _ T _.-
, - -L -..._, 
: op3 ) 
.. , . ... __ __ _ 
(a) 
R1 MAX ( tp (R1) , tp(R2)) 
Clock 
--- ------,--
tsetup(A3) 
(b) 
Figure 7.4: The register-to-register delay path: (a) var node insertion, (b) the 
structure. 
4. In the interchange optimization procedure, it takes O(pq) time to locate 
feasible merging superedges, where pis the number of superedges and q is the 
number of supernodes. For each feasible merging superedge, it takes O(r + 
n log n) to relocate nodes and estimate area, where r is the average number of 
variables or operations associated with the feasible merging superedge. Thus, 
each interchange optimization loop takes O(pq + s(r + nlogn)) time, where 
s is the average number of feasible merging superedge. In our_ experience, 
the local optimal (no_morejmprove) state can be achieved in less than 20 
iterations (interchange optimization while loop). 
159 
7 .2 Back Annotation for Clock Estimation 
Typically, the dock period is determined by the most critical register-to-
register delay over ali register-to-register paths. Figures 7.4( a) and (b) show a 
register-to-register path of a CDFG example and its corresponding structure. To 
realize the delay for each segment of a register-to-register path in a CDFG, we 
insert an interna! node var on each edge of the CD FG, as shown in Figure 7.4( a). 
Hence, we can use Equation 5.19 (the register-to-register delay of an operation) to 
compute the datapath delay. 
In the following of this section, we describe a walk through example (Figure 7.5). 
Figure 7.5( a) shows a data-fl.ow graph example that is scheduled into five steps. 
Given an adder and a multiplier, Figure 7.5(b) shows the operation and variable 
assignments. Operations addl, add2 and add3 are assigned to the adder (Adder), 
and operations subí and sub2 are assigned to the subtracter (Sub). For the eight 
variables a, b, e, d, e, f, g and h, and we need three registers to store them as 
shown in Figure 7.5(b ). Figure 7.5( c) the DFG after inserting var nodes, and the 
resulting structural graph and structure are shown in Figures 7.5( c) and ( d). 
To obtain layout information, we can use the area and timing models de-
scribed in Chapter 5 or retrieve the layout information from the real layout. In 
this section, we describe how to back-annotate the propagation-delay of the data-
path from the real layout. We assume the bit-width of the datapath is eight with 
160 
step step 
o 
FU(+) 1 add1 ,add2,add3 
FU(-) 1 sub1 ,sub2 (!): var nodes. 
R1 lac0-1 > ld(1-3) E<3-4> 1 
2 2 
e, 
R2 lb(0-2) ~(2-3) ~(3-4), var 1 
3 3 
R3 1 c(0-2) 1 f(2-3) 1 
4 variable(birth-Oeath) 4 
(a) (b) (e) 
(d) 
(e) 
Figure 7.5: The back-annotation example: (a) DFG and schedule, (b) opera-
tion/variable assignments, (e) var node insertion, ( d) the graph, (e) the structure. 
161 
- ~ , f 1 1 1 ! l 
Figure 7.6: The layout of the back-annotation example (a) routing track assign-
ments, (b) the final layout. 
the bit-sliced stack layout architecture, as described in Chapter 3. The layout of 
the Figure 7.5 example is shown in Figure 7.6. Figure 7.7(a) shows the actual 
wire length and the delay of each component retrieved from the layout. Each edge 
delay is then computed. For example, the delay of edge e1 Figure 7. 7(b) is 5ns 
that is equal to the delay of Muxl. Based on the delay information shown in 
Figures 7. 7( a) and (b ), we can compute the register-to-register propagation de-
lay of each operation using Equation 5.19. For instance, the delay of operation 
addl is the sum of the maximum propagation delay of var-node 1 and var-node 2 
( 4.8ns), the maximum delay of e2 and e10 ( 4.8ns), the delay of the adder (22ns), 
the maximum delay of e3 and e4 (5.3ns) and the set-up time of var-node 6 and 
162 
b e 
Ce Cs 
connection wire length(um) component delay(ns) 
c1 130 Mux1 5.0 
c2 1,040 Mux2 5.3 
Mux1 Mux2 c3 260 Mux3 5.2 
Cs C1 c4 2,040 Mux4 5.0 
c5 10 Mux5 4.8 
Reg2 c6 660 Reg1 1/4.8 t 
C2 c7 280 Reg2 1/4.2 t 
c8 230 Reg3 1/4.7 t 
c11 c9 230 21.0 tt 
c10 410 + 22.0 tt 
c11 1,160 t, (set-up/propagation) delay. 
c12 500 
**: Delay with 8-bit component. 
C4 c13 1,760 
(a) 
O : var nodes. connection dela y( ns) var-node dela y( ns) e1 5.0 
step e2 4.8 ( set-up/propagation) e3 5.3 1 1/4.8 
e4 5.3 2 1/4.2 o e5 o.o 3 1/4.2 
e6 O.O 4 1/4.7 
e7 5.0 5 1/4.7 
e8 o.o 6 1/4.8 
e9 5.3 7 1/4.8 
e10 o.o 8 1/4.2 
e11 5.3 9 1/4.2 
e12 5.3 10 1/4.7 2 e13 5.2 11 1/4.8 
e14 5.2 12 1/4.2 
e15 5.0 
e16 4.8 
3 e17 5.3 
e18 5.0 
e19 5.3 
4 e20 o.o e21 5.2 
e22 4.8 
e23 5.3 
e24 o.o 
(b) 
Figure 7. 7: (a) Back-annotation of wire lengths and component delays, (b) Back-
annotation of delay information to the DFG. 
163 
var-node 7 (lns), that is, the total datapath propagation-delay of operation addl 
is 37.9ns. Using the same procedure, we can compute the delay for each operation 
and determine the worst register-to-register delay using Equation 5.21. Similarly, 
we can compute the control-unit delay, and thus compute the dock period using 
Equation 5.18. 
One interesting observation from this example is that using this supergraph 
model we can back-annotate delay information to each node and edge in the CDFG. 
This is very useful in the interactive synthesis process because this detailed delay 
information can pinpoint the critica! delay point ( e.g., a particular wire or compo-
nent) or delay path in the CDFG as well as the structure. 
7 .3 Experiments 
We have implemented the previously described algorithm using C program-
ming language on SUN4 workstations under UNIX. We have tested the binding 
algorithm on the elliptic filter benchmark with different schedules, including 17-
step with 3-adder and 2-piped multiplier, 19-step with 2-adder and 2-multiplier, 
21-step with 2-adder and 1-multiplier, and 19-step with 2-adder and 1-piped mul-
tiplier. Figure 7.8 shows the schedule of the 19-step with 2-adder and 1-piped 
164 
multiplier example. Figures 7.9, 7.10, 7.11 and 7.12 show the four different im-
plementations of the same design as shown in Figure 7.8. The other examples can 
be found in [WuGa91]. 
We use the single-level multiplexer model. The transistor-pitch and wire-
pitch coefficients (a and /3) were calculated from the VTI 1.5-µm datapath library 
[VTI88]. The final layouts were generated using Mentor Graphics GDT tools. 
Figure 7.13 shows the datapath of a 16-bit elliptic filter using layout architectures 
I and II described in Chapter 5, in which architecture I uses 13 over-the-cell routing 
tracks for each bit slice. Figures 7.14 and 7.15 show the results of four different 
designs. Since the areas of multipliers for each design are the same, the areas 
shown in Figures 7.14 and 7.15 do not include t];ie multiplier. Furthermore, the 
results only show the area of 1-bit datapath. 
The results show that all implementations using architecture 1 require less 
than 13 actual routing tracks so that an extra routing area is not needed. Hence, 
the datapath area using architecture I is solely dependent on the number of tran-
sistors in the datapath. On the other hand, using layout architecture 11 both 
transistors and routing tracks contribute equally to the total area. The results 
also show that neither the design with the minimum number of registers nor the 
design with the mínimum number of muxes can guarantee the minimum area. For 
instance, 
165 
(1) The designs with the minimum number of registers do not always produce the 
minimum area, such as: 
(i) 21-step and architecture 1 (Figure 7.15(e)). 
(ii) 19-step and 19-step with 2-adder and 1-piped multiplier, 
and architecture II (Figure 7.15( d)(h) ). 
(2) The designs with the minimum number of mux inputs do not always produce 
the minimum area, such as: 
(i) 19-step with 2-adder and 1-piped multiplier, and architecture I (Figure 7.15(g)). 
(ii) 21-step and architecture II (Figure 7.15(f)). 
(iii) 17-step, architecture I and II (Figure 7.15(a)(b)). 
(3) The design that produces the minimum area using layout architecture I does not 
guarantee the minimum area using layout architecture II. For example, the 21-step 
design (Figure 7.15(e)) with 11 registers and 27 mux inputs produces the minimum 
area using layout architecture I but not layout architecture II (Figure 7.15(f)). 
To calculate the total area of the design including datapath, control unit 
and multiplier, we have experimented with a 16-bit, 19-step, 2-adder, and 1-piped 
multiplier elliptic fil ter example. U sing datapath architecture I, we implemented 
two control-logic models, PLA and random logic, along with mux interconnect 
models. Figures 7.16 and 7.17 show a 16-bit elliptic filter example with PLA 
and random-logic implementations, respectively. Since the multiplier is treated as 
166 
a macrocell, the area of the multiplier is obtained directly from the component 
library. Figures 7.18 shows that using random-logic implementation the design 
with 13 register produces the minimum total area, while using PLA implementation 
the design with 11 registers produces the minimum total area. We have also used 
the described area and delay models to explore the design space of the elliptic filter 
benchmark. The results in Figure 7.19 shows that the 17-step design has fastest 
speed and largest area, while the 21-step design has the slowest speed and sinallest 
area. 
n2 n13 n18 n26 n33 n38 n39 out 
controf 
step 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
Figure 7.8: The schedule of the 19-step Elliptic Filter benchmark. 
167 
168 
Figure 7.9: The structural netlist with 10 registers implementation. 
Figure 7.10: The structural netlist with 11 registers implementation. 
169 
Figure 7.11: The structural netlist with 12 registers implementation. 
Figure 7.12: The structural netlist with 13 registers implementation. 
170 
Figure 7.13: The data path layouts of a 16-bit elliptic filter example: (a) architec-
ture I, (b) architecture II. 
171 
** ArH not lncludlng multlpller 
l.lpaut~I 1 l.,autArc:litlctu1tl 
Contr~ hf+ tof• 
'°' '°' ¡•w. ..... htl MI. 
Aclull Arel Aclullhl 
Steps Alg. 'Ja lrfutl (umfliO (umlllO 
17 3 21'ed 10 11 /34 552 'l1 11 136,720 193i 117 
17 3 21'Ped 11 10/ 33 564 'l1 11 138,080 195.038 
17 3 21'Ped 12 8/ 31 572 'l1 11 138,480 195.480 
17 3 21'Ped 13 9/33 604 28 10 145.040 198,430 
(1) 
u,out Ardlltll:W 1 Uyaut Arc:litlclln 1 
Control lof+ lof" '°' '°'¡•SIL ..... htl MI. Anlll Arla Ali!llMI S1epl Alg. SIL ""* (llMWI) (llMlllO 
19 2 2 10 8/30 472 23 10 113,4tl 156,420 
19 2 2 11 6/21 480 22 9 113,760 151,726 
19 2 2 12 6/21 500 23 9 117,280 156,862 
19 2 2 13 6/29 524 24 9 122,0IO 163,212 
(b) 
L"°'" Ardlitalre 1 u,out Ald!~re 1 
Control #of+ tof• lof '°'¡•SIL ..... fnttl Ms. Aclullhl Aclullhl Steps Alg. SIL lrfutl ~liO ~·liO 
21 2 1 10 7/30 480 20 • 113,53& 146,464 
21 2 1 11 51'11 480 19 10 111,45& 151,836 
21 2 1 12 5/28 504 19 9 115,296 152,934 
21 2 1 13 6/ 31 540 23 9 126,'TS 168,235 
(e) 
1 u,out Ald!~re 1 u,out Ald!italre 1 
Control #of+ f af 1 
'°' 
tof ¡•~. llrs. #nets llrb. AnlllArll Anlll Arel Steps ~ 'Ja ~ (llMWI) (umf Wt) 
19 2 1-piped 10 10/36 sm 23 10 125,376 170,976 
19 2 11'Ped 11 6/21 4&2 20 9 113,136 150,045 
19 2 11'Ped 12 6/26 492 21 8 115,696 148,272 
19 2 1-plped 13 5/23 480 21 8 113,696 146,672 
(d) 
Figure 7.14: The results of the Elliptic Filter example: part l. 
(a). lT4iep, muz. and arebHectu.re l. 
14'000..-----...------.------. 
14S>OO 
14-4000 
143000 
lüOOO 
141000 
140000 
lSIOOO 
138000 
137000 
actual --
13aM>O'------..._ ____ __._ ___ _, 
(10,:W) (ll,33) (12,31) (13,33) 
(1 al Rei.,I ol S.L Input.) 
(e). 19-e&ep, muz, and arcb.ltecture L 
123000.------.....-----.....-------. 
122000 
lUOOO 
120000 
118000 
118000 
111000 
116000 
115000 
114000 
actual -
!------~ 113000 .__ ____ ......_ ___ _._ ____ __, 
(IQ,30) (11,.28) (12,28) (13,29) 
(1 ol Rec.,I ol BeL Input.) 
(e). 21-•iep, muz, and arcbiiecture L 
12.8000..-----....... -----.-------. 
128000 
124000 
122000 
120000 
118000 
116000 
114000 
112000 
actual --
110000 .__ ____ .....__ ___ _._ ____ __, 
(IQ,30) ( 11,.27) (12,28) (13,31) 
(1 al Rec.,I ol SeL Input.) 
1
d 19-•&ep (2-adder, 1-plped), muz, and arcb.liecturc L 
actual -
124000 
122000 
120000 
118000 
116000 
114000 
112000'------...._ ____ _._ ____ __, 
(lll,3e) (11,.28) (12,28) (13,23) 
(1 ol Rec.,I ol SeL Input.) 
200000 ,...--(b-).-17_·•-iep,_,._•_uz. __ ªº_d_ar_cb.l_&.edure ___ n. _ ___, 
aotual --
19llOOO 
198000 
197000 
196000 
195000 
194000 
193000~----...._ ____ _._ ____ __, 
(10,34) (11,33) 02.31) (13,33) 
(1 ol Rec.,I al S.L lnpu&.) 
(d). 19-•iep, mu., and aroJú&ec:tuN IL 
184000 .------.....----------..... 
aotual --
182000 
180000 
158000 
156000 
154000 
152000 
150000 .__ ____ .....__ ___ __._ ___ __. 
(10,30) (ll,28) (12,28) 
(1 ol Rec.,I al S.L lnpuia) 
(O. 21-•iep, mu., and arcb.lieclQuoe L 
170000 .-----....... ----.....-------. 
185000 
180000 
155000 
150000 
145000'------....._ ____ _,_ ____ _, 
(10,30) (11,27) (12,28) (13,31) 
(1 ol Rec.,I ol S.L lnpu&.) 
(b). 19-•iep (2-adder, 1-plped), muz. and arcb.lteciure IL 
175000 
actual -
170000 
185000 
180000 
155000 
150000 
145000 ...._ ____ .....__ ___ __,_ ___ __. 
(10,38) (11,28) (12,28) 
(1 ol Rec.,I al Sel Input.) 
Figure 7 .15: The results of the Elliptic Fil ter example: part 2. 
173 
Figure 7.16: The final layout of a 16-bit elliptic filter example with PLA imple-
mentation. 
Figure 7.17: The final layout of a 16-bit elliptic filter example with random-logic 
implementation. 
174 
Arclitectln 1 tnollogic Total Arel 
#of #of ¡•Set Mplier Datapath PLA Randcxn Logic PLA RandOfn Logic 
Reg. Sel. lnpuls ArellLITll AreaJ!lml AINJ!lml Area J!lml AreaJ!lmÍ Arel (umj_ 
10 10/36 2,330,880 2,006,016 312,256 255,352 4,649,152 4,592,248 
11 6/28 2,330,880 1,810, 176 267,540 230,082 4,408,596 4,371,138 
12 6/26 2,330,880 1,851,136 266,228 196,616 4,448,244 4,378,632 
13 5/23 2,330,880 1,819,136 259,116 200,889 4,409,132 4,350,905 
Figure 7.18: The overall area estimation of the elliptic filter example. 
1 
1 
1 
1. 
1 
#of # of + / # of * steps 
17 3 / 2-piped 
19 212 
21 2 / 1 
19 2 / 1-piped 
Total area 
(10 6 urrf) 
#of regs/:ef /~ 
12 / 8 / 31 
11/6/28 
11 / 5 / 27 
13/5/23 
(17 / 3 / 2-piped) 
7.0 .. 
.. .. .. .. • (19 / 2 / 2) 
' 
' 
' 
' 
' 
' 
6.6 
' 
' 
' 
' 
' 
' 
4.4 
4.3 
' 
Clock (ns) Total execution time (ns) 
64.8 1,102 
60.1 1,142 
62.0 1,302 
66.5 1,263 
(a) 
(Control-step / # of + / # of *) 
' 
' 
' 
' 
' 
' 
' 
Total area (um2) 
7,074,632 
6,682,152 
4,329,526 
4,350,905 
... - - - - - • (21 / 2 / 1 ) (19 I 2 / 1-piped) 
-----------------Total delay (ns) 
1,100 1,200 1,300 
(b) 
175 
Figure 7.19: The area-time curve of four different design of the elliptic filter bench-
mark: (a) table, (b) AT-curve. 
176 
7.4 Conclusions 
This chapter presented a new unit-binding approach that uses the supergraph 
and layout-area model for design-quality evaluation during datapath optimization. 
We have shown that datapath optimization by minimizing the number and size of 
registers or muxes do not always guarantee the minimum area. Since our proposed 
approach is technology independent, our algorithm can evaluate design quality 
and select the minimum area design using different component libraries and layout 
architectures. 
We have also shown that using the supergraph model we can back-annotate 
delay information (layout) to each node and edge in the CDFG (behavior ). This 
approach provides a unified view of the behavior and the layout. Because of this 
unique feature, we can pinpoint the critical delay wire or component in the CDFG 
and the structure simultaneously, that can provide very useful information for the 
interactive synthesis process. Furthermore, by combining the unified graph model, 
layout model and unit-binding algorithm, we can explore the design space. 
Chapter 8 
Conclusions 
8.1 Summary of Contributions 
This dissertation presented an approach to chip synthesis. The essential 
issues involved in chip synthesis, including target architecture, layout-synthesis 
method, design model and integration techniques between behavioral and layout 
synthesis, were presented. 
First, a sliced-layout architecture was presented for generalized register-transfer 
(RT) netlists. Using the sliced-layout architecture, a layout-synthesis system was 
developed for layout generation from RT netlists. The system introduced a new 
partitioning approach that considers the component layout-style, floorplan and 
critica! paths simultaneously to improve the overall area utilization and to mini-
mize the critica! wire length. 
177 
178 
Second, to obtain more realistic quality measures for behavioral synthesis, 
a layout model, including area and timing models, was presented. This layout 
model considered most technology factors and thus produced more accurate esti-
mates than previously proposed models. This research also presented two different 
views of quality measures, "accuracy" and "fidelity". When the accuracy of es-
timates is the main concern, the layout model should be formulated using the 
actually implemented algorithms or should be evaluated by running the actually 
implemented algorithms. On the other hand, when the estimation turn-around 
time is the main concern, a simple and fast estimate with high fidelity is needed. 
In the chip synthesis, we can mix these two types of quality measures for design 
evaluation, such that we use the fast and high-fidelity estimates for design tradeoffs 
and use the slow but accurate estimates to evaluate the design quality. 
Third, a unified model was developed to bridge the gap between behavioral 
and structural descriptions. This model encapsulates both behavior and structure 
of the design that provides a unified behavior / structure view of the design. Hence, 
using this unified model, synthesis tools can retrieve layout information at any 
design level to support design decision making and design tradeoffs. 
Finally, two methods, layout-model driven and feedback driven, were pre-
sented to incorporate layout information into behavioral synthesis. A unit-binding 
approach combining the supergraph model and layout model was presented for dat-
apath optimization. The experiments showed that the previous algorithms which 
1 ! 
179 
minirnize the number and size of registers and muxes do not always guarantee the 
minimum area. Contrary to previous algorithms that minirnize the number and 
size of units, our approach can evaluate the area quality of the design and select the 
minimum area design. In addition, a back-annotation method was presented that 
can pinpoint the critica! delay wire or component in the behavioral description. 
This feature is very useful for interactive synthesis. 
8.2 Future Work 
While the essential issues of integration of behavioral and layout synthesis for 
chip design are addressed in this research, a number of issues and improvements 
need to be studied further. First, at the layout-synthesis level, more sophisticated 
bit-sliced stack partitioning techniques, such as interleaved folding, are needed to 
improve stack area utilization. An I/0-pad placement and routing algorithm is 
needed in order to generate a complete chip. 
Second, there are many open quality-measure pro~lems that need to be ex-
plored. The area measures for specific layout architectures, such as gate array, sea 
of gates and FPGA, should be studied further . In order to improve the accuracy 
of area/performance measures, the impact of control-logic optimization needs to 
be taken into account. Furthermore, more extensive empirical study is needed to 
objectively establish our proposed layout model. 
180 
Third, using the unified representation and layout model, different behavioral-
synthesis tasks, including module selection, scheduling and allocation, need to be 
developed. In addition, system-level partitioning scheme using the proposed uni-
fied representation and layout model should be studied further. 
¡ 
1 
[Arms89] 
[BrGa90] 
[ChGa90] 
[ChWG91] 
[C1Th90] 
[CNSD90] 
[DeNe89] 
[DiTh89] 
[DuKe85] 
[DRSC86] 
Bibliography 
Chip Level Modeling with VHDL, Prentice-Hall, 1989. 
F. Brewer and D.D. Gajski, "Chippe: A System for Constraint Driven 
Behavioral Synthesis," IEEE Transactions on Computer-Aided Design 
of lntegrated Circuits and Systems, vol. 9, no. 7, pp. 681-695, July, 
1990. 
G. D. Chen and D. D. Gajski, "An Intelligent Component Database 
System for Behavioral Synthesis," Proceedings of the 27th Design 
Automation Conference, pp.150-155, 1990. 
V. Chaiyakul, A. C-H Wu and D. D. Gajski, "Timing Models for 
High-Level Synthesis," Info. & Computer Science Dept., UCI, Tech. 
Rep. 91-70, 1991. 
R. J. Cloutier and D. G. Thomas, "The Combination of Scheduling, 
Allocation, and Mapping in a Single Algorithm," Proceedings of the 
27th Design Automation Conference, pp. 71-76, 1990. 
H. Cai, S. Note, P. Six and H. De Man, "A Data Path Layout 
Assembler for High Performance DSP Circuits," Proceedings of the 
21th Design Automation Conference, pp.306-311, 1990. 
S. Devadas and A. R. Newton, "Algorithms for Hardware Allocation in 
Data Path Synthesis," IEEE Transactions on Computer-Aided Design 
of Integrated Circuits and Systems, vol. CAD-8, no. 7, pp. 768-781, 
1989. 
E. Dirkes Lagnese and D. E. Thomas, "Architectural Partitioning for 
System Level Design," Proceedings of the 26th Design Automation 
Conference, pp. 62-67, 1989. 
A. E. Dunlop, and B. W. Kernighan, "A Procedure for Placement 
of Standard-Cell VLSI Circuits," IEEE Transactions on Computer-
Aided Design of lntegrated Circuits and Systems, Vol. CAD-4, No. 1, 
pp.92-98, 1985. 
H, De Man, J. Rabaey, P. Six and L. Claesen, "CATHEDRAL-II: 
A Silicon Compiler for Digital Signal Processing," IEEE Design and 
Test, 1986. 
181 
182 
[FiMa82] C.M. Fiduccia and R.M. Mattheyses, "A Linear-Time Heuristic 
for Improving Network Partitions," Proceedings of the 19th Design 
Automation Conference, pp. 175-181, 1982. 
[GDT89] "GDT Database and Language Tools," Silicon Compiler System, Sec. 
7, v. 4.0, 1989. 
[GDWL92] D. D. Gajski, N. Dutt, A C-H Wu and Y-L Lin, High-Level 
Synthesis: Introduction to Chip and System Design, Kluwer Academic 
Publishers, 1992. 
[HaSt71] A. Hashimoto and J. Stevens, "Wire Routing by Optimizing Channel 
Assignment within Large Apertures," The 8th Design A utomation 
Conference Workshop, pp. 155-169, 1971. 
[HCLH90] C. Y. Huang, Y. S. Chen, Y. L. Lin and Y. C. Hsu, "Data Path 
Allocation Based on Bipartite Weighted Matching", Proceedings of 
the 27th Design A utomation Conference, pp. 499-504, 1990. 
[Hilf85] 
[JaJe85] 
[Joha79] 
[John67] 
P. N. Hilfinger, "A High-Level Language and Silicon Compiler for 
Digital Signal Processing," Proceedings of Custom Integrated Circuit 
Conference, 1985. 
R. Jamier and A. Jeraya, "APOLLON: A Datapath Compiler," 
Proceedings of the International Conference on Computer Design, 
1985. 
D. L. Johannsen, "Bristle Blocks: A Silicon Compiler," Proceedings 
of the 16th Design Automation Conference, pp.310-313, 1979. 
S.C. Johnson, "Hierarchical Clustering Schemes," Psychometrika, pp. 
241-254, September, 1967. 
[KeLi70] K.H. Kernighan and S. Lin, "An Efficient Heuristic Procedure for 
Partitioning Graph," Bel/ System Technical Journal, vol. 49, no. 2, 
pp. 291-307, February, 1970. 
[Knap89] D. W. Knapp, "Feekback-Driven Datapath Optimization in Fasolt," 
Proceedings of the International Conference on Computer-Aided 
Design, pp. 300-303, 1989. 
[KuPa87] F.J. Kurdahi and A.C. Parker, "REAL: A Program for Register 
Allocation," Proceedings of the 24th Design A utomation Conference, 
pp. 210-215, 1987. ~ 
[KuPa89] F.J. Kurdahi and A.C. Parker, "Techniques for Area Estimation of 
VLSI Layouts," IEEE Transactions on Computer-Aided Design of 
Integrated Circuits and Systems, vol. 8, No.l, pp. 81-92, January 1989. 
1 
1 1 
¡ 1 
1 
1 1 
1 1 
1 1 
1 
' 1 
1 
183 
[KuRa91] F.J. Kurdahi and C. rarnachandran, "LAST: A LAyout Area and 
Shape function esTirnator for High Level Applications," Proceedings 
od The European Conference on Design A utomation, pp. 351-355, 
1991. 
[LaGW91] Lawrence L. Larrnore, D. D. Gajski and Allen C-H Wu, "Layout 
Placernent for Sliced Architecture," IEEE Transactions on COmputer-
[LiGa87] 
[LiGa88] 
[LuDe89] 
Aided Design of Integrated Circuits and Systems, vol. 11, no. 1, pp. 
102-114, 1992. 
Y. L. Lin and D. D. Gajski, "LES: A Layout Expert Systern," 
Proceeding of the 24th Design A utomation Conference, pp.672-678, 
1987. 
J. S. Lis and D. D. Gajski, "Synthesis frorn VHDL," Proceedings of 
the International Conference on Computer Design, pp.378-381, 1988. 
W. K. Luk and A. A. Dean, "Multi-Stack Optimization for Data-
Path Chip (Microprocessor) Layout," Proceeding of the 26th Design 
A utomation Conference, pp.110-115, 1989. 
[LyEG90] T. A. Ly, W. L. Elwood, and E. F. Girczyc, "A Generalized 
Interconnect Model for Data Path Synthesis," Proceedings of the 27th 
Design Automation Conference, pp.168-173, 1990. 
[McFa86] M.C. McFarland, "Using Bottom-Up Design Techniques in 
the Synthesis of Digital Hardware frorn Abstract Behavioral 
Descriptions," Proceedings of the 23rd Design A utomation Conference, 
pp. 4 7 4-480, 1986. 
[McKo90] M.C. McFarland and T.J. Kowalski, Incorporating Bottom-Up Design 
into Hardware Synthesis," IEEE Transactions on Computer-Aided 
Design of Integrated Circuits and Systems, vol. 9, no. 9, pp. 938-950, 
1990. 
[NGCD91) S. Note, W. Geurts, F. Catthoor and H . . De Man, "Cathedral-
III: Architecture-Driven High-Level Synthesis for High Throughput 
DSP Applications," Proceedings of the 28th Design Automation 
Conference, pp. 597-602, 1991. 
[Pang88] B. M. Pangrle, "Splicer: A heuristic Approach to Connectivity 
Binding," Proceedings of the 25th Design Automation Conference, pp. 
536-541, 1988. 
[PaGa87] B. M. Pangrle, and D. D. Gajski, "Design Tools for Intelligent 
Silicon Cornpilation", IEEE Transactions on Computer-Aided Design 
of Integrated Circuits and Systems, vol. CAD-6 no. 6, pp. 1098-1112, 
1987. 
184 
[PaKG86] P. G. Paulin, J. P. Knight and E. F. Girczyc, "HAL: A Multi-
Paradigm Approach to Automatic Data Path Synthesis," Proceedings 
of the 23rd Design Automation Conference, pp. 263-270, 1986. 
[PaPM86] A. C. Parker, J. Pizarro and M. Mlinar, "MAHA: A Program for 
Datapath Synthesis," Proceedings of the 23rd Design A utomation 
Conference, pp. 461-466, 1986. 
[PePr89] M. Pedram and B. Preas, "Interconnection Length Estimation for 
Optimized Standard Cell Layouts," Proceedings of the Internationa/ 
Conference on Computer-Aided Design, pp. 390-393, 1989. 
[PeRu81] P. Penfield Jr. and J. Rubenstein, "Signal Delay in RC Tree 
Networks," Proceedings oí the 18th Design Automation Conference, 
pp. 613-617, 1981. 
[PWSE86] B. R. Petersen, B. A. White, D. J. Salomon and M. I. Elmasry, "SPIL: 
A Silicon Compiler with Performance Evaluation," Proceedings of the 
International Conference on Computer-Aided Design, pp. 500-503, 
1986. 
[RaPB85] J. Rabaey, S. Pope and R. Brodersen, "An integrated Automatic 
Layout Generation System," IEEE Transactions on Computer-Aided 
Design of Integrated Circuits and Sys_tems, vol. CAD-4, pp. 285-296, 
1985. 
[RiHi88] K. Rimey and P. N. Hilfinger, "A Compiler for Application-Specific 
Signal Processors," VLSI Signa/ Processing, pp. 341-351, 1988. 
[RDVG88] J. Rabaey, H. De Man, J. Vanhoof, G. Goossens and F. Catthoor, 
"Cathedral-II: A Synthesis System for Multiprocessor DSP Systems," 
in Silicon Compilation, ( Gajski, D.D. editor) Addison-Wesley 
Publishing Co., pp. 311-360, 1988. 
[SiBN82] D. P. Siewiorek, C. G. Bell and A. Newell, Computer Structures: 
Principies and Examples, McGraw-Hill, 1982. 
[Sout83] J. R. Southard, "MacPitts: An Approach to Silicon Compilation," 
Computer, vol. 16, no. 12, pp. 7 4-82, 1983. 
[Shun91] C. B. Shung et al., "An Integrated CAD System for Algorithm-
Specific IC Design," IEEE Transactions on Computer-Aided Design 
of lntegrated Circuits and Systems, vol. 10, no. 4, pp. 447-463, 1991. 
[TrDi89] M. T. Trick and S. W. Director, "LASSIE: Structure to Layout 
for Behavioral Synthesis Tool," Proceedings of the 26th Design 
Automation Conference, pp. 104-109, 1989. 
1 
1 ~ 
[TsSi86] 
185 
C. J. Tseng and D. P. Siewiorek, "Automated Synthesis of Data Path 
in Digital Systems," IEEE Transactions on Computer-Aided Design 
of Integrated Circuits and Systems, vol. CAD-5, no.3, pp. 379-395, 
1986. 
[TLWN90] D.E. Thomas, E.D. Lagnese, R.A. Walker, J.A. Nestor, J.V. 
Rajan and R.L. Blackburn, Algorithmic and Register-Transfer Leve/ 
Synthesis: The System Architect's Workbench, Kluwer Academic 
Publishers, Boston, 1990. 
[VTI88] "Data path Library," VLSI Technology, INC., 1988. 
[WuGa89] Allen C-H Wu and D. D. Gajski, "SLAM: An Automated Structure 
to Layout Synthesis System," Info. & Computer Science Dept., UCI, 
Tech. Rep. 89-40, 1990. 
[WuCG90] Allen C-H Wu, G. D. Chen and D. D. Gajski, "Silicon Compilation 
from Register-Transfer Schematics," Proceeding of lnternational 
Symposium on Circuits and Systems, pp.2576-2579, 1990. 
[WuGa90] Allen C-H. Wu and D.D. Gajski, "Partitioning Algorithms for 
Layout Synthesis from Register-Transfer Netlists," Proceedings of the 
lnternational Conference on Computer-Aided Design, pp. 144-147, 
1990. 
[WuCG91] A. C-H Wu, V. Chaiyakul and D. D. Gajski, "Layout Area Models for 
High-Level Synthesis," Proceedings of the International Conference on 
Computer-Aided Design, 1991. 
[WuGa91] Allen C-H Wu and D. D. Gajski, "Layout-Driven Allocation for High 
Level Synthesis," Info. & Computer Science Dept., UCI, Tech. Rep. 
91-30, 1991. 
[Zimm88] G. Zimmermann, "A New Area and Shape Function Estimation 
Technique for VLSI Layouts," Proceedings of the 25th Design 
A utomation Conference, pp. 60-65, 1988. 
1 
1 
1 
1 
1 1 
1 ; 
1 1 
1 ' 
1 : 
1 
1 
1 ' 
1 
