A system for microarchitecture and logic optimization by Zanden, Nels Vander
UC Irvine
ICS Technical Reports
Title
A system for microarchitecture and logic optimization
Permalink
https://escholarship.org/uc/item/0qg124vr
Author
Zanden, Nels Vander
Publication Date
1991
 
Peer reviewed
eScholarship.org Powered by the California Digital Library
University of California
Notice: This Material 
may be protected 
by Copyright Law 
(Title 17 U.S.C.) 
A SYSTEM FOR MICROARCHITECTURE AND 
LOGIC OPTIMIZATION 
by 
Nels V ander Zand en 
--; ;:::::::---
Information and Computer Science Department 
University of California, Irvine 
Irvine, CA. 92717 
Technical Report 91-39 
ABSTRACT 
This thesis spans two levels of the design process by examining optimization at 
both the register-transfer level and at the logic level. More specifically, this thesis 
addresses the following two problems: 1) performing logic synthesis for custom layout 
rather than the traditional approach that focuses on synthesis for standard cells, and 2) 
performing optimization for custom layout from register-transfer level netlists. Thus 
optimization is performed on the microarchitecture design and at a. lower level for indi-
vidual microa.rchitecture components. 
First, techniques are introduced for generating gate-level netlists that take advan-
tage of custom layout capabilities. Such techniques include limiting serial/parallel 
transistor chains, transistor sizes, and capacitive loads in forming complex gates. These 
considerations have not been incorporated in previous logic syn thesis systems. 
Second, techniques are introduced for improving the microarchitecture structure 
and using estimates from lower-level optimization tools to guide mkroarchitecture 
design optimizations that attempt to meet user specified area and time constraints. 
These techniques include the capability for mixing layout styles such as custom layout 
for random-logic components and bit-slicing for regularly structured components. In 
this manner the entire design, control logic and datapath, can be optimized at the same 
time. Further, this paper presents a new methodology for microarchitecture-level 
optimization that greatly reduces the amount of technology-specific knowledge neces-
sary to perform the optimizations. 

lll 
A SYSTEM FOR :VIICROARCHITECTURE 
AND LOGIC OPTIMIZATION 
Nels B. Vander Zanden, Ph.D. 
Department of Computer Science 
University of Illinois at Urbana-Champaign, 1991 
Daniel D. Gajski, Advisor 
In recent years the drive to produce more complex integrated circuits while 
spending less design time has driven the demand for design automation tools. The 
search for design automation methods has resulted in the design of numerous 
behavioral synthesis' and logic synthesis tools. This thesis spans two levels of the 
design process by examining optimization at both the register-transfer level and at 
the logic level. More specifically, this thesis addresses the following two problems: 
1) performing logic synthesis for custom layout rather than the traditional approach 
that focuses on synthesis for standard cells, and 2) per.fanning optimization for cus-
tom layout from register-transfer level netlists. Thus optimization is performed on 
the microarchitecture design and at a lower level for individual microarchitecture 
components. 
First, techniques are introduced for generating gate-level netlists that take ad-
vantage of custom layout capabilities. Such techniques include limiting 
serial/parallel transistor chains, transistor sizes, and capacitive loads in forming 
complex gates. These considerations have not been incorporated in previous logic 
synthesis systems. 
Second, techniques are introduced for improving the microarchitecture struc-
ture and using estimates from lower-level optimization tools to guide microarchitec-
ture design optimizations that attempt to meet user specified area and time con-
straints. These techniques include the capability for mixing layout styles such as 
custom layout for random-logic components and bit-slicing for regularly structured 
components. In this manner the entire design, control logic and datapath, can be 
optimized at the same time. Further, this paper presents a new methodology for 
microarchitecture-level optimization that greatly reduces the amount of 
technology-specific knowledge necessary to perform the optimizations. 

A SYSTEM FOR MICROARCHITECTURE 
AND LOGIC OPTIMIZATION 
BY 
NELS BLAKE V ANDER-ZANDEN 
B.S.C.I.S., Ohio State University, 1984 
M.S., University of Illinois, 1986 
THESIS 
Submitted in partial fulfillment of the requirements 
for the degree of Doctor of Philosophy in Computer Science 
in the Graduate College of the 
University of Illinois at Urbana-Champaign, 1991 
Urbana, Illinois 

CCopyright by 
Nels B. Vander Zanden 
1991 

lV 
To my father, 
James V ander Zan den 
v 
ACKNOWLEDGEl\IIENTS 
I had the good fortune of working with a top-notch advisor, Dr. Daniel Gajski, 
while pursuing my Ph.D. He contributed to this work in many ways not only in 
terms of technical discussions, ideas, and advice, but also in terms of a great deal of 
encouragement and inspiration. I have learned much from him and am greatly 
honored to have worked with him. 
My father also deserves credit for ensuring that I always had the best when it 
came to education. His support and confidence in me and my abilities have always 
given me great strength and reassurance. Together with my brother Brad, I have 
received much love and friendship and known that they are always there in tim~s of 
need. 
I'm also extremely thankful for the exceptional work environment. My col-
leagues in CAD LAB were more than just officemates, they were a great set of 
friends. They include Dr. Nikil Dutt, Joe Lis, Tedd Hadley, Frank Vahid (and his 
wife Amy, an honorary member of CADLAB), Allen Wu, Jim Fradkin, Sanjiv 
Narayan, Viraphol Chaiyakul, Young Kim, Jim Kipps, Gwo-Dong Chen, Elke Run-
densteiner, Rajesh Gupta, Loganath Ramachandran, Salman Iqbal, Adil Haque, and 
Kenichi Kanehara. They have all been of assistance in some manner on this disser-
tation and hence they deserve much credit as well. In addition I shared many ex-
tracurricular activities with them including games of tennis and volleyball, movies, 
VI 
beach outings, and dinners. They also helped me get through numerous late night 
hours at the lab with entertaining discussions and the occasional midnight snack ex-
cursion. I greatly value each friendship and to hope to have the opportunity to 
work with them on future projects. 
I would also like to thank my good friends Tim and Mary Kraus for the many 
fun weekends I shared with them in San Diego. In particular, Tim and I have en-
gaged in many activities, discussions, and challenging tennis matches since my early 
days at the University of Illinois. With another good friend, Gary Bolotin, I ex-
plored some of the natural wonders of Soµthern California with many hikes and 
outdoor outings. Such weekends and outings helped me to relax and reapproach 
my work with a fresh perspective. 
Bob Larsen, a Fellow at Rockwell International, also deserves credit. He serves 
as an important link to the industrial world for academic researchers and gave me 
many excellent pointers and suggestions that enhanced the quality of my work. In 
addition he supplied a number of the benchmark examples presented in this disser-
tation. 
Barb Cicone was of great assistance to me in the administrative arena during 
my affiliation with the University of Illinois. She is one of the friendliest and helpful 
administrators I have known. I took great comfort in her ability to cut through red 
tape and her constant willingness to lend a helping hand. 
vu 
Finally I would like to thank my committee members -- Professors Michael Fai-
man, Larry Jones, Kent Fuchs, and Alex Veidenbaum. In particular I appreciate 
the help and friendship provided by Dr. Faiman. 
viii 
TABLE OF CONTENTS 
CHAPTER 
1. INTRODUCTION ......................................................................................... 1 
1.1. Design Synthesis Overview ..................... .................................. ............ 1 
1.2. Contributions . . ... . .. . ... . . .... ... ... ...... .. . ..... ... . .. . ... .... .. .. .. ... ..... .... ... ... . ... . . .. . . .. 5 
1.3. Thesis Overview . . .. . . .. . . . . . . .. . . .. . . .. . . .. . .. . .. .. . . .. . .. . . . . . . . .. .. ... . .. . . .. . . .. . . .. . . . . . . .. . . . . 7 
2. DESIGN PROCESS ........................... ............................................... ............ 8 
2.1. Introduction ............................... ......... .................................................. 8 
2.2. Design Synthesis Process .. . . .. . . .. . . .. . .. . .. .. . . .. . .. . .... . ... .. ... . .. . . .. . . .. . ... . . .. . . .. . . . . 9 
3. PREVIOUS WORK IN LOG IC SYNTHESIS .. . ... .. .. ......... ..... .... .. .... .. .... .. .. .. 14 
3.1. Introduction ............................... ........................................................... 14 
3.2. Refinement Techniques ................................... : ..... ~............................... 15 
3.3. Optimization Techniques ... .... .... ... ... ... ... .... ... .. ... ... .. ........ .... ... .... ... .. .... .. 23 
3.4. Optimization Strategies .. ........... ......... ....... ........ ......... .............. ............ 38 
4. THE MILO SYSTEM . .. . . .. . . .. . . ... . .. . . .. . . .. . . .. .. ... . .. . ... ...... .. .. ... . .. . . ....... ... . ..... .. . . . . 49 
4.1. Introduction .................................................. :....................................... . 49 
4.2. Previous Work . . . . .. . . . . . . . . . . .. . . .. . . . . . . .. . . . . .. .. . . .. . .. . ... . . ... .. ... . .. . . .. . . .. . . .. . . .. . . . . . . . . 49 
4.3. System Architecture . . . . ... ........ ... ... . .. ... .... ... ....... .. .. ...... ........... ... .... .. .... .. 52 
5. LOG IC SYNTHESIS FOR CUSTOM LAYOUT ...... ............. ... .... ... .. .......... 61 
5.1. Introduction .......................................................................................... 61 
5.2. Parameters for Custom Layout . ................ ... .......... .......... ........ .... .. .... .. 63 
5.3. Strategies for Layout Driven Synthesis .. .............................................. 71 
5.4. Algorithm for Layout Driven Synthesis . .. .... ... . .... . ..... .. .. .. .. .. .... . .. .. . .. . ... 72 
5.5. Results . . . . . . . . .. . . . .. . .. . . . . . . . . . . .. . . . . . . .. . . .. . . . . . . . .. . .. . .. . .. .. . . .. . . . .. . ... . .. . . .. . . .. . . .. . . . . . . . . 86 
6. MICROARCHITECTURE OPTIMIZATION .............................................. 92 
6.1. Introduction .......................................................................................... 92 
6.2. Types of Microarchitecture Optimization . ............................... ...... .... .. 92 
6.3. Strategies for Microarchitecture Optimization ....... ............ ... .... .... . ...... 115 
DC 
6.4. Experimental Results ........................................................................... . 126 
7. CONCLUSION ............................................................................................. . 153 
7.1. Summary .............................................................................................. . 153 
7.2. Future Research .................................................................................. .. 154 
APPENDIX A. LOPT Program Description .................................................. .. 157 
A.l. Tutorial ............................................................................................... . 157 
A.2. Input Files ........................................................................................... . 160 
A.3. LOPT Parameter File Format ........................................................... .. 162 
A.4. LOPT Report File Output Format .................................................... . 163 
A.5. LOPT Usage ....................................................................................... . 163 
APPENDIX B. IIF ........................................................................................... . 165 f 
B.l. Introduction ........................................................................................ . 165 
B .2. IIF declarations ................................................................................... . 166 
B .3. IIF expression ...................................................................................... . 166 
B .4. Examples of IIF Usage ........................................................................ . 168 
B.5. Operator Precedence ........................................................................... . 172 
APPENDIX C. MILO Program Description .................................................... . 173 
C. l. Introduction ........................................................................................ . 173 
C.2. Milo Usage ........................................................................................... . 173 
C.3. Milo Parameter File Format .............................................................. .. 175 
C.4. MILO Report File Output Format ..................................................... . 176 
BIBLIOGRAPHY .............................................................................................. . 177 
VITA ................................................................................................................. . 181 
x 
LIST OF FIGURES 
Figure 1. Synthesis Overview ..................................................... ................ ....... 4 
Figure 2. SOCRATES System Architecture ... .. . .... .... .... .. ...... ... . ..... .. .. .... .. .... .... 17 
Figure 3. Logic Consultant System Architecture ...................... ..... ........... .... ... 18 
Figure 4. Factorization for Timing Improvements ... ................ .... .. ............... ... 20 
Figure 5. LSS System Architecture .......... ............ ............ ......... ..... ........ .......... 22 
Figure 6. Expert Systems . . .. .. . . .. . . .. .. .. .. ... .. . . .. . .. .. . ... . . .. . . .. . .. . .. .. ..... .. . .. . . . . .. .. . . . . . . . . .. 25 
Figure 7. LSS Level 1 Optimizations ................................................................ 33 
Figure 8. Techniques for Reducing Delay Along Critical Paths .... . .. .. ..... . .... . ... 42 
Figure 9. Use of a Hash Table to Reduce the Number of Rules ....................... 47 
Figure 10. MILO Environment .. . .. . . .. . . .. .. . . . .. . . .. .. . .. . . .. . .. . . .. . .. . .. . .. . .. .. . . .. .. .. .. .. . .. .. .. 54 
Figure 11. Tool Interface with the Component Database ................................ 59 
Figure 12. Standard Cell Layout vs. Custom Layout .... ......... ... ............. ... .... ... 62 
Figure 13. Custom Layout Results Using Synthesis for Standard Cells .......... 65 
Figure 14. Complex Gate Formation ................................................................ 67 
Figure 15. Effect of Transistor Sizing on Path Delay ................ ............... .... . ... 67 
Figure 16. Effect of Large Transistor Sizes in Complex Gates .. ....................... 70 
Figure 17. Algorithm for Layout Driven Synthesis .......................................... 73 
Figure 18. Critical Path Reduction Operations .... .. .. .. ...... ........ .... .. .. .. .. ..... .... ... 76 
Figure 19. Example of Shortening Critical Path .. .. ... ....... .... .... . ............. ... ... .... 78 
Figure 20. Area/Delay Considerations for 2-Level Complex Gates .................. 82 
Figure 21. Area/Delay Considerations for 3-Level Complex Gates .................. 83 
Figure 22. Composite Graph of Experimental Results . .... .. .. .. .. .... .. .......... .... . ... 90 
Figure 23. Minimization Rules . . . .. . . . . .. . .. . .. . . .. . .. .. . .. . .. .. . . . . . .. . .. .. . . .. . .. . .. .. .. .. .. .. . .. . . .. . 93 
Figure 24. Factorization of Microarchitecture Components . . ... ...... ........... .... ... 94 
Figure 25. Example of Factorization .......................................... ....................... 98 
Figure 26. Signal Swapping ..................................................... _... ....................... 99 
Figure 27. Merge Similar Units .. .. .. . .. .. .. . .. . .. .. . . .. .. .. . . .. . .. .. .. . .. . . .. .. . .. . .. . .. .. .. .. .. .. .. . .. 102 
Figure 28. Three Possible Merging Cases ......................................................... 104 
Figure 29. Results of Merging ........................................................................... 106 
Figure 30. Rule for Merging . . . . .. . . .. . . .. . . .. . .. . . .. . . . .. . .. . . . . . . . .. . . . . . . . . . . . . . .. . . . .. . .. . . . .. . . . . . . .. 106 
Figure 31. Merging Unsimilar Units ..... .... .... ..... ... ................ .... . ....................... 106 
Figure 32. Component Duplication ................................................................ ~.. 109 
Xl 
Figure 33. Common Subexpression Extraction ............................................... .. 110 
Figure 34. Common Subexpression Elimination .............................................. . 113 
Figure 35. Common Subexpression Elimination Example .............................. .. 114 
• I 
Figure 36. Overview of Microarchitecture Optimization ................................ .. 116 
Figure 37. Random Logic Grouping ................................................................ .. 120 
Figure 38. Option for performing ALU functions ........................................... .. 125 
Figure 39. Block Diagram of the Rockwell Counter ....................................... .. 128 
Figure 40. Three Optimization Approaches for the Rockwell Counter .......... .. 131 
Figure 41. Block Diagram of Armstrong Counter .......................................... .. 132 
Figure 42. Three Optimization Approaches for Armstrong Counter ............. .. 135 
Figure 43. Block Diagram of DRACO ............................................................ .. 136 
Figure 44. Three Optimization Approaches for Draco2 ................................. .. 142 
Figure 45. Three Optimization Approaches for Draco3 ................................. .. 143 
Figure 46. Three Optimization Approaches for Draco Schematic .................. .. 144 
Figure 47. Layout of Module Generator Design for Draco2 ............................ . 146 
Figure 48. Layout of MISII Design for Draco2 ................................................ . 147 
Figure 49. Register with Parallel Load ............................................................ . 169 
Figure 50. Four-bit Adder/Subtracter ............................................................ .. 170 
Figure 51. Tristate Connection ........................................................................ . 171 
Xll 
LIST OF TABLES 
Table 1. Custom Layout results for MISII and the LDS Algorithm ...... .......... 87 
Table 2. Custom Layout Results Using MCNC Benchmarks ........................... 88 
Table 3. Standard Cell Layout Results Using MCNC Benchmarks ................. 89 
Table 4. Time Optimization Results for the Rockwell Counter .......... ..... .. .. .. .. 129 
Table 5. Area Optimization Results for the Rockwell Counter ........ ... .. .... .... .. . 130 
Table 6. Time/ Area Tradeoffs for Rockwell Counter ....................................... 130 
Table 7. Time Optimization Results for the Armstrong Counter .. .. .. . .. . .. .. .. .. .. 133 
Table 8. Area Optimization Results for the Armstrong Counter .. ... .. .. .. .. .. ... .. . 133 
Table 9. Area/Time Tradeoffs for the Arm.strong Counter .............................. 134 
Table 10. Time Optimization Results for Draco2 .. ... ..... ..... ....... .... ...... ...... .... . .. 137 
Table 11. Area Optimization Results for Draco2 ............................................. 137 
Table 12. Time Optimization Results for Draco3 .......... ... . .... .... ... .... ... ... .... ...... 138 
Table 13. Area Optimization Results for Draco3 .. ..... ... .... . .... ................ .... .. . .. . 138 
Table 14. Time Optimization Results for Draco Schematic ... ............ ... .... ..... .. 139 
Table 15. Area Optimization Results for Draco Schematic .............................. 139 
Table 16. Time/ Area Tradeoffs for Draco2 ... ....... .... .. .... ... ..... ... ..... .. ..... .... .... . .. 140 
Table 17. Time/ Area Tradeoffs for Draco3 ...................................................... 140 
Table 18. Time/ Area Tradeoffs for Draco Schematic .... ... .. .... ...... .. .. .. .. .. .. .. ... . .. 141 
Table 19. Comparison of MILO and MISH Timing Optimization .................... 147 
Table 20. Comparison of MILO and MISH Area Optimization ... .. .. .. .. . .... .... .. . 148 
Table 21. MILO gate vs. MI~O module generator designs (Time) .. .... .. .. .. ... ... 149 
Table 22. MILO gate vs. MILO module generator designs (Area) ................... 150 
Table 23. MISH vs. MILO module generator designs (Time) ... .... ...... ...... ..... .. 151 
Table 24. MISH vs. MILO module generator designs (Area) ........................... 152 
1 
CHAPTER 1. 
INTRODUCTION 
Over the past decade tremendous advances have been made in VLSI design. 
Greatly increased circuit complexity, however, continues to outpace engineers' abili-
ties to develop chips with up to a million transistors. Conventional design methods 
have entailed teams of highly skilled and experienced engineers providing many 
months of effort. More recently, companies have been turning to logic synthesis 
tools to help alleviate market pressures that demand lower-cost implementations 
with shorter development times. These tools allow less experienced designers to 
develop application-specific I Cs (AS I Cs), typically gate arrays or standard cells. 
Further, the automated systems free a designer from the exploding number of 
details and provide greater time to examine high-level issues and experiment with 
various architectures. Thus such tools increase productivity and provide for better 
complexity management. 
1.1. Design Synthesis Overview 
Synthesis tools employ two basic techniques m producing a final, workable 
design. These are refinement and optimization. Refinement 1s the process of 
2 
transforming a behavioral description in to an initial design. Usually this design 
consists of components such as multiplexors, decoders, and gates such as AND and 
OR. Optimization is the process of transforming the initial design into one that 
meets some set of objectives relating to time, area, etc. Often a design proceeds 
through a number of levels of abstractions. At each level, these two stages are per-
formed. For example, typically refinement and optimization can be used on a gen-
eric representation and later on a technology-specific one. 
The synthesis process is shown in Figure 1. Refinement begins with the user's 
behavioral description. The input may take a textual or graphical form. Textual 
representations would be a hardware description language, such as VHDL, a set of 
boolean equations, or a table, such as PLA format. Graphical representations 
include a generic or technology-specific schematic, or a menu-driven interactive pro-
cess that extracts a set of parameters from the user. Behavioral descriptions in 
language form require a compiler to generate an initial netlist consisting of high-
level components such as decoders and registers. Diagrams entered through 
schematic capture may also consist of high-1evel components. Boolean equations, 
PL.A format, or low-level schematics are representations for the logic level. 
The behavioral description represents a black box whose inputs, outputs, and 
functionality are described. The representation general}y does not convey any infor-
mation as to the technology type or necessarily any style of implementation -- be it 
parallel, pipelined, etc. Since many systems accept different behavioral 
Schematic 
Capture 
3 
Behavioral 
Compiler 
Microarchitecture 
Optimizations 
Logic· 
Optimizer 
Layout 
System 
Figure 1. Synthesis Overview 
4 
specifications, they must first convert them all to one central format, such as a gen-
eric schematic or set of boolean equations. This central format is then used to 
create a detailed representation of the design. 
The design created in the refinement stage is usually by no means optimal and 
thus the optimization phase is called upon to improve performance. Figure 1 
identifies two levels at which optimization can take place -- at the microarchitecture 
level and the logic level. One could add other intermediate levels. A logic generator 
is required to decompose the design in to the lower level. 
Optimization can be guided by two different strategies. The first is to optimize 
everything until no further improvements can be made. When optimizing for multi-
ple constraints there may be a weighting or priority scheme to determine which 
design improvements should be performed. Essentially, this is a "brute-force" 
method as optimizations tend to be made in random parts of the circuit in no 
directed fashion. The second strategy is constraint driven. It relies on a set of 
user-entered parameters to direct the flow of optimization. For example, optimiza-
tions may be applied initially only along a critical path when attempting to meet 
timing constraints. Once one constraint is met, the optimizer will turn to another 
and work toward meeting the user's objectives. Once these are satisfied, the optimi-
zation stage may end or may apply the first strategy. 
5 
1.2. Contributions 
This thesis discusses three aspects of the design syn thesis process: 1) logic 
optimization, 2) rnicroarchitecture optimization: and 3) the integration of logic and 
microarchitecture optimization tools into a larger behavior to layout synthesis tool. 
In particular, it describes l\fiLO, a system that performs both microarchitecture and 
logic optimization. 
The following novel features are introduced in this thesis: 
1. Omtrol algorithms and rules for optimization of the microarchitecture. In 
previous systems, the microarchitecture design is not fully optimized before 
decomposing the design into logic gates. Techniques are described in this thesis 
for optimizing components at a microarchitecture level before they are decom-
posed into gates. Such optimizations are nearly impossible to perform after 
decomposition. 
2. Develop~nt of algorithm; and a tool that perforIIB logic synthesis for cus-
tom layout. Traditional logic synthesis tools optimize for standard cell and gate 
array implementations. Additional parameters need to be considered when syn-
thesizing for custom layouts. 
3. Use of evaluation in the design process. Feedback to the microarchitec~ure 
optimizer from lower level optimization tools permits reexamination of the 
high-level design. By using logic optimization and technology mapping tools on 
6 
each microarchitecture component, the microarchitecture optimizer can obtain 
accurate estimates of each component's delay and area. Modifications can then 
be made to bring the microarchitecture design closer to the user-specified con-
straints. 
4. Incorporation of layout styles into rri.croarchitecture optini.zation. Microar-
chitecture components can be laid out in various styles including standard cell, 
bit-sliced, and custom layout. The output of the microarchitecture optimizer is 
passed to a placement tool that forms a bit-sliced datapath section and then 
places random logic around the datapath. Giving the microarchitecture optim-
izer the capability to select different layout styles increases its ability to meet 
the user constraints for time and area. 
5. New rrethcxlology for tool integration. In traditional optimization systems, a 
logic syn thesis tool is tied in with higher level synthesis tools. Using the. new 
methodology, logic and layout optimization tools are hidden from the microar-
chitecture optimizer by a component database. The microarchitecture optim-
izer accesses the database to perform low-level optimizations. In this manner, 
logic optimization tools can be added or removed independently of the microar-
chitecture optimizer. 
7 
1.3. Thesis Overview 
The thesis is organized as follows. Chapter two outlines the necessary parts of 
a behavior to layout synthesis system. It discusses the responsibilities of each part 
and shows how logic and microarchitecture optimization can be applied. Chapter 
three surveys previous work in the logic synthesis area. Chapter four provides an 
overview of the rnicroarchitecture optimizer and shows how it incorporates the logic 
optimizer to get low-level design statistics. The chapter also shows how the 
rnicroarchitecture optimizer fits into a larger system that produces layout from a 
behavioral description. Further, it illustrates the types of commands used to 
retrieve information from the database and how the database is used to hide logic-
level details from the microarchitecture optimizer. Chapter five describes a new 
approach to logic synthesis for custom layout. Chapter six discusses issues m 
rnicroarchitecture optimization and presents an algorithm for optimization of 
rnicroarchitectural designs. Finally, chapter seven summarizes the contributions 
made by this work and examines some future work that can be performed. 
8 
CHAPTER 2. 
DESIGN PROCESS 
2.1. Introduction 
Generation of digital hardware generally passes through four stages of develop-
ment: behavior, microarchitecture, logic, and layout. Behavior describes the func-
tionality of the hardware and has often been written using VHDL, Pascal-like 
languages or C. Behavioral synthesis tools convert these descriptions into a 
microarchitecture structure called a Register-Transfer-Level design. This structure 
consists of components such as AL Us, registers, counters, and multiplexors. Each of 
these components can in turn be expanded into a logic-level design consisting of 
gates and flip-flops. Finally a layout can be generated from transistors that com-
pose each gate. 
Today's designers are increasingly able to enter their designs at higher levels of 
abstraction. Recently a number of tools that can translate a behavioral description 
to structure have been developed. Some of these tools are tuned to a particular style 
of architecture and hence little further optimization is required on the microarchi-
tecture level. Other tools produce varying styles of architecture usually involving 
9 
control and datapath sections. As these architectures are more general, they tend 
to be less polished and more optimization of their microarchitecture structure is 
required. 
2.2. Design Synthesis Process 
Transforming a behavioral description into layout reqmres the work of a 
number of stages: behavioral synthesis, microarchitecture optimization, logic optimi-
zation, floorplanning, and layout. This section describes the goals and interactions 
of these tools. 
2. 2.1. Behavioral Synthesis 
Behavioral synthesis tools convert a behavioral description into a datafiow 
graph with each node representing a functional operator (such as add or compare) 
[CaRo85] [OrGa86] [McPa88). These operations must be assigned to a control step, 
through the process of scheduling [PaGa87] [PaKn87), that chooses a point in time 
at which the operation will be performed. In addition, the operator is assigned to a 
particular hardware module, through the process of binding [TsSi86] [PaPM86]. 
During this process, the synthesis tool explores multiple designs and attempts to 
determine which designs appear most likely to meet the set of user constraints. Esti-
mators are employed to guide the synthesis tool towards one or more such designs. 
10 
Estimates for behavioral synthesis tools are usually obtained in one of two 
fashions. The first technique uses a set of formulas that when given a component 
type (ALU, Register. etc.) and its set of parameters (eg., number of inputs, archi-
tecture style, technology-type) produces a rough estimate of the time and area. 
Such estimates are not finely tuned but help to weed out unacceptable designs. A 
second technique is to expand the design into a lower level design consisting of 
gates, possibly even mapping the gates into a technology-specific library. Alterna-
tively, a high-level floorplan of the microarchitecture components can be generated 
to obtain a feel for design characteristics. These methods require more time to pro-
duce the estimates and will usually be reserved for use when the number of possible 
designs has been greatly narrowed. 
The use of estimators allows the behavioral syn thesis tool to select an overall 
architecture by making decisions on the number of busses, use of pipelining, etc. In 
addition, the synthesis tool attempts to minimize the number of connections 
between modules and reduce the total number of modules. An appropriate archi-
tectural style must also be chosen for each microarchitecture component, such as 
ripple-carry or carry lookahead when using an adder. Because the estimates are 
only a rough predictor of the final design after layout, more rigorous analysis and 
optimization is required of the microarchitecture design. Thus there is a need for a 
microarchitecture optimization tool. 
• 
11 
2. 2. 2. 1\1icroarchitect ure Optinization 
The major goals of microarchitecture optimization are to: (a) remove inefficient 
constructs, ( b) select a style of architecture for each component that suits the 
area/time requirements, (c) insert buffers on outputs that have a high fanout, (d) 
select which microarchitecture components to combine and perform logic optimiza-
tion on as a single unit, and ( e) select a layout style for each microarchitecture com-
ponent such as PLA, random logic, bit-sliced, etc. Once the initial microarchitec-
ture structure has been cleaned up, the optimizer has two options in producing the 
final design: (a) completely expand and optimize or (b) only partially expand and 
optimize. 
The first approach is to combine all components into a single combinational 
block and optimize. Logic optimization tools have been shown to be very effective 
for reducing the area of a design or restructuring logic to meet timing constraints. 
This approach may not be the best, however. First, logic optimization of large 
designs may require large amounts of CPU time and memory. The same will be 
true in the layout phase when floorplanning is performed. Second, some optimiza-
tions can be made at the microarchitecture level that cannot be made at the logic 
level. 
The second approach involves only a partial expansion of the design. Various 
groups of the components can be combined into a single component and optimized. 
For example, random logic gates can be grouped together and passed to a logic 
12 
optimization tool while more regularly structured components such as ALUs are 
optimized separately and not combined with the surrounding logic. Sometimes logic 
optimization can be avoided if the random logic is implemented as a PLA. Higher 
level components such as AL Us and registers can be implemented from libraries of 
prefabricated integrated circuits. For example, components from the library of the 
TTL 7400 series. Another alternative is the use of layout module generators. 
Module generators can be used for components with a regular style of architec-
ture such as ALUs and registers. For these components a one-bit layout slice is 
generated and then replicated based on the component's bit width. The bit-sliced 
layout will typically be more compact than what could be generated using standard 
cell or custom layout generators to layout the same logic from a random logic 
description. Thus using module generators for components with regular structures 
will usually result in denser layouts. In some circumstances, however, a component 
such as an ALU can be combined with surrounding logic to reduce the number of 
gates. This saving of gates may produce a smaller layout thari if module generators 
has been used. Thus the microarchitecture optimizer must be able to discover such 
conditions. 
2.2.3. Floorplanning/Layout 
As mentioned earlier, different layout styles can be used for each component. 
The fl.oorplanner employs a set of layout tools to produce the layout for each 
microarchitecture component. Then it needs to determine how to arrange these 
13 
different layout styles on the same integrated circuit. For example, bit-sliced com-
ponents are grouped into one or more datapaths, each datapath containing com-
ponents of approximately the same bit-width. Components in each datapath are 
arranged in a vertical fashion, forming a bit-sliced stack. Random logic components 
can then be placed around the border of the bit-sliced stack and into unfilled spaces 
that result from bit-slice mismatches in the stack. 
The fioorplanning tool must also take into account routing concerns. For 
example, components along the critical path should be placed close together to 
reduce the amount of delay caused by routing. Timing information can be supplied 
by the microarchitecture optimizer for this purpose. 
14 
CHAPTER 3. 
PREVIOUS WORK IN LOGIC SYNTHESIS 
3.1. Introduction 
A number of automated logic synthesis systems have been previously reported: 
LSS [JoTr86J, SOCRATES [GrBa86], MIS [BrRu87J, BOLD [BoHa87), DAGON 
[Ke87]. Commercially available tools include the Logic Consultant [Kim87], the 
Synopsys Design Optimizer, and Sile Technologies Silcsyn. These systems have used 
algebraic, language compiler, and expert system techniques to reduce circuit delay 
and gate count in combinational logic. In general, such systems are best suited for 
optimizing random logic that can easily be represented by boolean equations or 
PLA format. An exception is LSS which accepts a register-transfer language as its 
input, allowing entry of high-level operators. However, only limited types of optimi-
zation are performed on the high level operators before they are decomposed into 
gates. Once a design is at gate level it is impossible to recover high-level informa-
tion that may be necessary for restructuring the design in order to meet timing or 
area constraints. 
15 
3.2. Refinerrent Teclmiques 
3.2.1. SOCRATES 
Figure 2 shows the SOCRATES system. SOCRATES accepts boolean equa-
tions, PLA format, or a netlist as a behavioral description. Two-level boolean equa-
tions are used as the central format. Multi-level boolean equations can be extracted 
from the netlist, then run through an expander to generate the two-level format. 
Likewise, an extractor can be used to generate two-level boolean equations from 
PLA format. An algebraic minimizer, ESPRESSO IIC minimizes the two-level equa-
tions, taking advantage of don't care conditions. The next step uses weak-division 
to find common subterms in the design and form ml,.llti-Jevel equations, thereby 
reducing the amount of logic required to implement the function. This minimized 
design is written back into netlist form from which the logic optimizer will operate. 
The optimizer is discussed in a later section. 
3.2.2. Logic Consultant 
The modules comprising the Logic Consultant system are displayed in Figure 3. 
Like SOCRATES, the Logic Consultant accepts boolean equations, PLA format, 
and netlists generated from schematics as input to its system. Schematics can be 
created on a Mentor workstation using components from Mentor's generic library 
(GENLIB) or components from a technology-specific library. A decomposition 
module converts the input description into components from the Trimeter Generic 
Logic 
Optimizer 
16 
Extractor 
Expander 
Espresso 
Minimizer 
Weak 
Division 
Synthesis 
Figure 2. SOCRATES System Architecture 
Extractor 
17 
Extractor 
Decomposition 
Minimization 
Factorization 
Cell 
Selection 
Optimization 
Figure 3. Logic Consultant System Architecture 
18 
Library ( TG L) and isolates the combinational logic for processing by the minimiza-
tion module. Technology-specific components are replaced with TGL components 
through user-entered rules in the knowledge base. For MSI components (ie., multi-
plexors, decoders, ALUs, etc) this replacement is optional. The user specifies 
whether the MSI elements should be decomposed into TGL gates (via the rule base) 
or should be left "as is". Those components that are not decomposed are treated 
like sequential logic elements and removed from the design that is passed to the 
minimizer. Thus if an MSI component is not decomposed, it will not be optimized 
with the surrounding logic. This strategy poses a problem since in decomposing the 
MSI components, one loses the advantage of using them. Typically high-level com-
ponents are added to a technology library because the designer of that component 
constructed it in such a way that it takes up less area and has greater speed than a 
corresponding gate implementation. But once a component is decomposed for 
optimization, it may be difficult or impossible to find after flattening and refactoring 
the circuit. However, if the components are not decomposed, no optimization is 
performed. As will be seen later in this thesis, a better solution is to perform some 
optimizations at the microarchitectural level. 
The minimizer module accepts combinational logic consisting entirely of TGL 
components from the decomposition module. The minimizer then develops a two-
level SOP design and removes redundant terms. This new design will be passed to 
the factorization module. For some circuits it may be desirable to bypass the 
minimization module and go directly to factorization. This is the case when the 
19 
generated two-level SOP form contains an explosive number of terms. These cir-
cuits require extensive CPU time and after factorization typically contain more com-
ponents than the original design. For example, certain multi-level circuits may 
require many times their number of circuit elements to be represented in a two-level 
format. As factorization is performed on a local basis (and not globally), one cannot 
guarantee that the factorization will be optimal. Hence factorization may be unable 
to reduce the gate count to the prior number of elements. Designs· with a large 
number of XOR and XNOR gates are one example of this type of circuit. Thus the 
Logic Consultant's minimizer does not produce a better design in all cases. 
The factorization module attempts to factor out common terms to produce a 
multi-level design. It also factors in such a way to reduce the delay along the long-
est path. For example, consider Figure 4. The bottom input of the AND gate is 
part of the longest path. This path can be shortened by factoring the 3-input AND 
into two 2-input AND gates. Thus some timing considerations are taken into 
account. Note however that the longest path may not always by a critical p'ath. 
When this is the case, the assumption that it is a critical path will prevent optimi-
zation for area along that path. Hence, such a factorization strategy is not truly 
constraint driven. 
Factorization continues until no further common terms or timing improvements 
can be found. The Logic Consultant factors all paths as completely as possible. 
Since this includes critical paths, certain factorizations must be undone at a later 
c 
D 
time. 
3.2.3. LSS 
20 
t__ Longest Path 
A 
B 
Figure 4. Factorization for Timing Improvements 
The LSS system architecture is shown in Figure 5. LSS proceeds through four 
different description levels to produce an optimized technology-specific circuit: 
high-level, AND /OR, NAND /NOR, and technology specific. Each level requires a 
translator to produce its description format from a higher-level description. Like-
wise, each level has an optimizer to apply simplifying transformations. By introduc-
ing multiple levels, the designers of LSS followed the example of programmmg 
language optimizing compilers that produce several intermediate descriptions before 
generating assembly-language code (DaJo80]. Changes can be made through a 
number of levels, thereby simplifying analysis and optimization at each level. For 
• 
21 
Level 1: 
Logic 
Optimization 
Level 2: 
AND/OR 
Optimization 
Level 3: 
NA ND/NOR 
Optimization 
Level 4: 
Technology-
Specific 
Optimization 
Figure 5. LSS System Architecture 
22 
example, simplification can be made in terms of generic components, then converted 
to technology-specific components for further optimization. At tempting to improve 
a high-level description directly into a low-level one is a much tougher task. Also 
by using generic levels, only the technology-specific optimizations need be rewritten 
when the technology changes. 
LSS begins with an algorithmic representation usmg a register-transfer 
language. Using a simple translator, LSS produces a logic graph representing the 
design. Translation is straightforward as operators in the behavioral description are 
assigned to nodes in the graph. Each node is a generic component (such as an AND 
gate or decoder) and is connected via the graph's edges to other components. 
Optimization at this level is discussed in a later section. 
Another translator is called to produce the second level AND /OR description. 
It decomposes high-level components, such as decoders, into AND /OR/NOT gates. 
Optimizations are then applied at this level. The third description level, 
N AND /NOR consists of only NAND and NOR gates. Once again, the translQ._tor 
that produces this description is achieved through naive transformations that may 
produce unnecessary N ANDs and NORs. These "extra" gates are removed by the 
optimizer at this level. Finally, LSS provides a translator to produce a technology-
specific design using components from a technology library that consists of gate-
array macros (such as NAND/NOR gates, complex AND/OR gates, etc). This tech-
nology description may then be optimized. 
23 
3.3. Optimization Techniques 
After refinement, synthesis tools call upon optimizers to improve the design. 
Their optimization techniques fall into one of the three expert system types shown 
in Figure 6. Each consists of some type of blackboard and knowledge base. The 
blackboard is where the design resides and can be examined by the knowledge base. 
Included in the blackboard is usually a netlist describing the design, statistics on 
area and path delays, a set of user constraints, and work space for the system to 
evaluate various attempts at refinement or optimization. The knowledge base con-
sists of a set of rules and algorithmic techniques that utilize the blackboard data 
structures and make modifications to them. Generally algorithms are assigned 
structured and well-defined tasks while rules handle unusual or loosely defined 
problems. 
3.3.1. Rules Only Strategy 
The first type of expert system is entirely rule based. This approach generally 
lacks structure as rules are entered essentially independently of one another. The 
control unit selecting a rule to fire must examine all rules to determine which rules 
are applicable. Any of the rules whose conditions are satisfied can be fired. It is up 
to some rule selection mechanism to choose one rule from that set. An early imple-
mentation of this type of system was Rl (McDe82J, a rule-based system for 
configuring Vax-11/780 computers. A more recent version of a strictly rule-based 
system is the Logic Consultant. Both of these systems use the OPS production 
Blackboard 
' 
Blackboard 
Blackboard 
_.. 
~ 
Control 
Strategy 
Control 
Strategy 
24 
Knowledge Base 
0 0 
0 
0 0 
0 
Knowledge Base 
o 0 o 0 
00 
0 
Knowledge Base 
Figure 6. Expert Systems 
Rules 
Rules/Algorithms 
(Mixed Strategy) 
Algorithms 
25 
system language to write rules for their knowledge base. Rl was written in OPS4, 
the Logic Consultant uses OPS83 -- the lastest OPS version. Each rule consists of a 
set of conditions and a set of actions to be performed when all of the conditions are 
met. Control in OPS is exercised through a recognize-act cycle. In this cycle all 
rules whose If-conditions are satisfied are found using the Rete match algorithm 
[BrFa85], then one of these rules is selected to be applied. Briefly, the Rete algo-
rithm compiles all of the rule conditions in an OPS program into a set of attributes 
whose values can be tested. These attributes are placed in a tree-structured sorting 
network. In doing so, it allows similar tests on attribute values to be shared among 
different rules. Further, once a test has been performed on a tree node, it is not 
redone until a change in data occurs upon which the attribute is dependent. These 
features make the Rete algorithm "much more efficient than a simple pattern 
matcher that examines each rule's conditions on every recognize-act cycle to deter-
mine which rules apply. 
From this set of applicable rules, called the conflict set, a single rule must be 
selected. OPS has a conflict resolution scheme to choose this rule based on a 
number of tests through which rules are eliminated from the conflict set. Priority is 
given to rules in the following order [Fo85]: rules that have not been executed, rules 
whose first attribute (value of the first condition) has been most recently altered, 
rules with the most conditions, rules that entered the conflict set most recently. 
26 
In a rule-based system incorporating such schemes, control over which rules are 
applied can only be achieved by adding more conditions to each rule. In Rl, an 
additional condition was added to each rule corresponding to the stage of the design 
task. The Logic Consultant system builds in limited structure by examining its set 
of applicable rules and making an evaluation of the gain produced by each rule. 
The rule with the largest gain can then be applied on the circuit. Another charac-
teristic of strictly rule-based systems is the lack of backtracking. Neither Rl nor the 
Logic Consultant permit backtracking -- once a rule is applied, it will not be 
undone. 
The Logic Consultant's first optimizer module is the cell selection module. It 
converts the design consisting of TGL components 'to technology-specific com-
ponents. This module uses rules from the knowledge base that specify how one or 
more TG L components map into a technology-specific component. 
Once a technology-specific design is derived, the design is further optimized by 
a technology-specific optimization module. This module utilizes rules that make 
equivalent circuit transformations to improve area or. time. The optimizer chooses 
which rules to apply based upon some formula that examines time/ area tradeoffs. 
For critical paths and the longest path, rules are selected that decrease delay but 
that may increase area. Along non-critical and short paths, the Logic Consultant 
applies rules that decrease area at the expense of time. 
27 
Certain rules may appear to increase area or time but can actually result in a 
decrease by applying "clean up" rules. The Logic Consultant has a classification of 
rules to do this. For example, if a rule adds inverters to the circuit, the optimiza-
tion module will examine a set of high-priority or "clean up" rules to determine if 
any of the high-priority rules can eliminate the inverters from the design. Before 
implementing any rule, the optimizer calculates the result of the rule applied with 
any high-priority rules to determine how the rule will affect area and time. 
vVhen entering rules into the knowledge base, the user indicates whether a rule 
is high-priority. The rule classification indicates only that the rule should be exam-
ined to determine whether it can "clean up" after a regular rule application. It does 
not give the high-priority rule preference over a "regular" rule. This high-priority 
rule classification provides a limited lookahead feature of one rule. 
3.3.2. Rules/ Algorithm; (:Mixed Strategy) 
The second expert system type makes use of a rule base but employs structur-
mg through algorithmic control. The algorithm establishes a hierarchy that deter-
mines which set of rules should be examined at any point in time. This added struc-
turing allows greater control over the type of rules that are eligible to fire. The top 
level of the hierarchy examines the blackboard to determine the current stage of 
optimization. Depending on the set of constraints that must be satisfied, various 
strategies may be selected for further investigation. At this stage, more information 
1s needed to evaluate the remaining strategies. Lower levels m the hierarchy are 
28 
called upon to produce it. From this information, the high-level decisions can be 
reached. 
Lower levels in the hierarchy inspect the black board in greater detail and can 
choose a single strategy for optimization. Inside each strategy is further hierarchy. 
Similar to the strategy selection, one technique in the strategy must be chosen from 
a number of possibilities. The chosen technique in turn calls upon still lower levels 
of the hierarchy to carry out a design transformation. These lowest levels include 
rules for manipulating data structures in the blackboard, keeping data in the black-
board up to date, and removing old or useless information. 
SOCRATES is an expert system that uses a mixed strategy for optimization. 
The SOCRATES optimizer is written in C and runs on the VAX 11/780. Like the 
Logic Consultant, it consists of rules that make local transformations on the circuit. 
SOCRATES also has procedures to provide feedback on the time and area savings 
produced by a rule. Through these measurements, SOCRATES can choose which 
rule to apply. In addition though, the SOCRATES rule base employs a limited 
hierarchy to reduce the number of rules that must be considered at any given time. 
It organizes the knowledge base into a number of classes, such as timing or area 
optimizing rules. Further, each class of rules is divided into subclasses of related 
rules. For example, one subclass might replace partially redundant multiplexors 
with NAND and NOR gates, and another might combine and synthesize those gates 
[GeCo85). SOCRATES examines only rules in a particular subclass at any one 
29 
time. Each class and subclass of rules is prespecified m a certain order in the 
knowledge base. SOCRATES then exammes each rule class and subclass m this 
order. Optimization in SOCRATES begins with rules that improve both time and 
area. Then rules are applied that optimize time, possibly at the expense of area, 
until all timing constraints are satisfied. Finally, area optimizations are made on 
noncritical paths, possibly at the expense of time, until no other area improvements 
can be made. 
A rule in SOCRATES consists of a target configuration and a replacement 
configuration. These configurations are patterns of components, pins, and nets that 
identify a particular circuit structure. SOCRATES has its own pattern matcher 
that determines which target configurations are present in a design. Like OPS83, it 
contains a recognize-act cycle in which possible rule applications are examined, one 
rule is selected and then applied. Because of the hierarchical rule-base, SOCRATES 
only needs to consider rules in the currently activated rule subclass. To eliminate 
rules in the conflict set, the recognize-act cycle may utilize lookahead. Each rule in 
the conflict set is evaluated by implementing its set of actions, then measuring the 
resulting effect. The rule's future effect (ie., an examination of the rules that may 
be applied after the initial rule application) can be observed via a state search. 
This involves setting up a search tree consisting of future design states. The search 
tree is a graph with the nodes representing possible circuit implementations and the 
arcs representing rule applications. The root of the tree is the current circuit imple-
mentation. Children of nodes in the graph are ordered by desireability. The 
30 
leftmost children being derived from "better" rules [CoBa85]. The path through 
this state graph producing the lowest cost function is the optimal sequence of rules. 
Construction of the graph is performed in a depth-first manner. After each 
transformation, the results are evaluated by a cost function. If the resulting circuit 
is acceptable the next set of rule applications will be examined, the "best" rule 
selected and its effects determined. The process is repeated until some maximum 
depth is reached in the search tree. If the resulting circuit is deemed "unaccept-
able", SOCRATES backtracks to the node's father and examines alternative circuit 
transformations [GaGr84]. In constructing the search tree, SOCRATES keeps a log 
of changes made to the circuit by each rule application. When backtracking is 
required, the changes to the circuit can be quickly undone by referring to this log. 
Each node of the tree will contain the cost function estimate of the circuit 
implementation. The lower the cost function, the better the circuit. Once the 
search tree has been completely built, it can be traversed to determine which 
sequence of rules is best. Those rules will then be applied. 
Since search trees can become massive and require large amounts of CPU time, 
the designers of SOCRATES introduced metarules that control the size of the 
search tree. Metarules are rules that contain control knowledge, as opposed to 
design knowledge that suggests circuit alterations. The SOCRATES metarules are 
based on the parameters presented in [CoBa85). The first parameter, B, restricts 
the number of sons any node may have. Thus it limits the breadth of the search 
31 
tree. The parameter D limits the depth of the tree by restricting the number of 
consecutive rule applications. Using the neighborhood control parameter, N, res-
tricts rule applications to gates of path distance N from some center gate. This 
prevents rules that apply to different circuit regions from being considered. The 
parameter D restricts the rule application depth. Although the search tree 
app 
extends to depth Dmax , only a portion of some sequence of rules will be executed. A 
Parameter A 1 was introduced to limit the number of rule classes to be examined c a.u 
beyond the current subclass. This variable decreases the number of rules that can 
be applied at any given time. Finally, the parameters Acoat , R, and S, relate to the 
cost function. ~co3 t limits the increase to the cost function by a single rule applica-
tion. Parameters R and S are used in the cost function to determine the desireabil-
ity of applying a particular rule. The cost function takes into account the area 
saved and the number of rules that can be applied after a transformation is made. 
The weigh ting of terms in the cost function affects the size of the tree by changing 
the desireability of various rules. 
Initially SOCRATES used fixed values for these parameters, regardless of the 
optimization phase. However, the ideal parameters vary greatly over the course of 
optimization. For example, greater lookahead is required for area-saving rules than 
general rules. Also, little or no lookahead is required for the most powerful rules 
[GeCo85]. Thus based on the state of the optimization, metarules determine what 
values the control parameters should have. These rules supervise the control 
module and dynamically vary the parameters. Such a technique permits more 
32 
selective use of lookahead. Control parameters can be changed depending on the 
rule class, rule subclass, or even an individual rule. 
Through this process of lookahead, the best sequence of rule applications can 
be determined. By examining the future effects of a rule, SOCRATES can determine 
whether the decision to apply a rule was good .. Poor decisions can be undone 
through backtracking and other rules can be considered. The designers of 
SOCRATES report this approach produces superior results to those where only one 
rule is examined and then applied [CoBa85]. Their examples indicate that the use 
of lookahead without metarules required roughly four times longer on average to 
run, producing designs with 12 percent less area on average. Adding metarules only 
doubled the run time and still provided the same decrease in area. 
LSS is a system that also employs limited hierarchy as it performs optimization 
at each of its four levels of representation. The LSS refinement stage produces an 
initial graph with AND /OR gates, registers, decoders, etc. The first level of optimi-
zation is performed on this network. It uses transformations that reduce the gate-
level logic and transformations that use information about a high-level component 
to reduce the surrounding gates. Figure ·7 demonstrates transformations of this type 
from [J oTr86]. The first rule uses know ledge about a decoder to eliminate the 0 R 
gate. The second rule in Figure 7 is a simple logfc reduction transformation. 
Once all level one transformations have been performed, the high-level com-
ponents can be expanded into AND/OR/NOT gates. This forms the second 
33 
Decoder 
Decoder 
10 AO AO A1 10 
11 A2 A1 
A3 11 A2 A3 
A) 
D-
B) 
Figure 7. LSS Level 1 Optimizations 
description type, the AND/ OR level. At the AND/ OR level, transformations per-
form AND /OR simplification, common subexpression elimination, and constant pro-
pagation (ie., OR(a,1) = 1, AND(a,1) =a) [DaJo81]. 
The third level, NAND /NOR, introduces some technology considerations. 
Depending on the technology, the design will be converted to one consisting entirely 
of generic NAND and NOR gates. The same type of transformations that were per-
formed at the AND/OR level are applied at this level, only with NAND/NOR 
simplifications in mind this time. Transformations at this level attempt to 
34 
incorporate technology tables that supply information on generic ~AND /0lOR 
gates. For example, information on each generic primitive (such as a N AND3) is 
maintained on its size, driving capability, delay, fan-in, etc. [JoTr86]. Hence at the 
NAND/NOR level it is possible to make decisions about what type of 0JA:>JD/~OR 
gates should be used. For example, LSS can tell whether to use three-input N ANDs 
or two-input N ANDs to reduce area. 
The final level of description is technology specific. Transformations convert 
generic components to components from a technology library that may include com-
plex gates such as AND /OR, multiplexors, etc. The transformations also enforce 
technology constraints such as fan-in and fan-out. To deal with complex gates, LSS 
makes use of tables that list the components available in each complex gate type 
(ie., AND /OR, decoder, OR/ AND). In each category the components are ordered 
according to the savings created. When conflicts arise as to which transformation 
to apply, the one with the largest gain is selected based on this ordering. 
The transformations performed at each level in LSS are executed through PL/I 
procedures that manipulate the logic graph. Transformations at the final two levels 
make use of a number of technology tables that can be readily updated for a new 
technology. It has been reported [JoTr86) that the use of local transformations as 
opposed to two-level boolean minimization tends to keep synthesis times linear for 
increasing design sizes. For circuits of 200 to 2000 two-input equivalent gates, a 
time of one second for roughly nine gates was reported (based on IBM 3081 CPU 
35 
time). 
3.3.3. Algorithms Only Strategy 
The final type of expert system is entirely algorithmic. It uses an algorithmic 
controller to determine how to apply algorithms in the knowledge base. Such a sys-
tem performs the same operations always in the same order. 
An example of an algorithmic system is l\1IS [BrRu87). MIS is a set of heuristic 
algorithms that perform minimization and factorization. MIS optimizes boolean 
equations through a sequence of commands that can be entered interactively or con-
trolled automatically via a user-written script that consists of a series of MIS com-
mands. 
Area optimization in MIS is carried out usmg three techniques: 1) extraction, 
2) resubstitution, and 3) phase assignment. Extraction is the process of generating 
common factors from a set of logic equations. For example, consider the equation: 
f = ad + bed +e. 
It has common divisor g = a+bc that can be extracted to produce f = gd, and com-
mon divisor h = a+b that can be extracted to produce f = hd(a+c). Possible com-
mon factors are generated by a set of heuristics. Area is estimated by counting the 
number of literals in all of the boolean equations. The area improvement obtai.ned 
by extracting a particular common factor is then the number of literals saved. 
36 
Because MIS uses heuristics to identify common factors, it may not find all fac-
tors that can be extracted. For this reason~ resubstitution is used to ensure that all 
common factors of a special type are identified. Resubstitution is the process of 
checking whether an existing function is a factor of another function in the network. 
For example, consider the two functions: 
x = ac + ad + be + bd + e 
y =a+ b. 
Function y is a divisor of x and x can be rewritten as: 
x = y( c + d) + e. 
Phase assignment attempts to reduce the total number of inverters in the 
design without increasing the size of other functions. Essentially, this involves 
determining whether a function or its complement requires fewer inverters. This 
includes checking intermediate functions (ie., subfunctions) for inverter reductions. 
Another algorithmic based system is DAGON [Ke87]. DAGON performs tech-
nology binding and can optimize for time, area, or some function of the two. It util-
izes programming-language compiler techniques that are strictly algorithmic. In 
doing so, DAGON can guarantee locally optimal matches over several thousand pat-
terns. 
Similar to language compilers, DAGON matches a graph description of a 
technology-independent circuit against a technology library consisting of numerous 
pat terns. The problem is viewed as finding the best technology patterns to cover a 
37 
directed acyclic graph (commonly termed DAG). A DAG is a graph containing no 
cycles that consists of nodes and directed edges. DAGON builds a DAG for boolean 
functions using nodes to represent AND and OR operators and edges labeled 0 or 1 
to indicate a true or inverted output (thereby providing N AND and NOR operators 
as well). 
A globally optimal solution to the DAG covering problem could be generated 
by comparing all possible technology implementations (each a collection of patterns 
from a technology library) for time and area. However, such a problem is NP-
complete. Therefore, DAGON's first step is to partition the graph into trees. This 
is accomplished by making every component in the graph whose fanout is greater 
than one, the root of a new tree. DAGON may then find the minimal cost technol-
ogy pattern for each tree, producing a locally optimal solution. 
Finding a minimal cost match for a tree consists of two major tasks: recognizing 
the set of possible matches and deter~ning the pattern from that set providing the 
minimal cost (in terms of time and area constraints). Finding applicable pattern 
matches is performed by twig [Tj86], a program designed to construct code genera-
tors for programming language compilers. It generates the set of matches in 
O(TREE SIZE) time. From this set the minimal cost match can be found using a 
recursive algorithm that determines the least cost match for each subtree. 
38 
3.4. Optimization Strategies 
The quality of a synthesized design can vary dramatically depending upon the 
strategies used to optimize it. Strategies can be examined at both a high and a low 
level. At a high level there are essentially two types of decisions for which a stra-
tegy must be chosen. They are: what to optimize for and what subsection of the 
overall design to optimize at any given time. At a low level there are also two types 
of decisions: what types of transformations or algorithms to apply and what order to 
apply them to the specified subcircuit. 
The first decision depends heavily upon the user-entered constraints. Typically 
three types of constraints are entered: time, area, and power. The optimizer must 
choose which order (if any) to optimize the constraints and which constraint( s) 
should carry the greatest weight in directing the optimization. 
The second decision concerns the portion of the design toward which optimiza-
tions should be directed. One could simply search the entire design for all the rules 
that are applicable and select one to apply. This has the effect of applying transfor-
mations randomly throughout the design. Generally the technique is time consum-
ing and produces less than optimal results. Human designers break up large circuits 
into smaller ones, making optimization more manageable and allowing more control 
over the course of the optimization. This approach is best for optimization pro-
grams as well. By focusing on only a small section of the design, one not only 
reduces the number of rules that must be considered but also allows a rule's effect 
39 
to be more closely examined -- one need only consider its effect in the subcircuit to 
know what happens in the larger design. 
When performing timing optimizations one tends to select a path-oriented 
approach. Thus the subcircuit might be a critical path. Area optimizations are not 
path directed except for avoiding critical or near-critical paths. In this case, critical 
paths could be removed to eliminate the time wasted by considering rules that 
affect some component that is part of a critical path. 
The third decision involves the specific rules or algorithms that can be applied. 
This decision is in part based upon the earlier issue of what to optimize for. Cer-
tain techniques work best when optimizing for time, others when optimizing for 
area. As an example, consider a rule that greatly decreases delay but improves area 
only incrementally. Clearly such a rule is best suited for use in timing directed 
optimizations. An alternative rule that produces greater area improvements can be 
used when optimizing for area. 
The final decision involves the order in which optimization techniques should 
be applied. The use of lookahead can aid greatly in assuring that the "best" rules 
will be used. Lookahead, however, can be quite time consuming and simply consid-
ering certain rules before others may be a better solution in some cases. Another 
example of the use of ordering can be seen in the use of algorithms. For instance, 
some algorithmic techniques, such as collapsing a network into two levels to reduce 
delay, require a great deal of time and effort. Yet if only a small improvement is 
• I 
40 
required, the effort is. in effect~ wasted. Another technique, perhaps a rule-based 
approach producing smaller gain 1 could have been used. 
3.4.1. Control Strategies for Timing Optimization 
Different control strategies are required for timing optimization. For example 1 
critical paths whose delays differ greatly frorri user specifications tend to require 
some type of circuit restructuring. In contrast, those critical paths that are close to 
the specifications may require that only a few gates be replaced or only a small por-
tion of the critical path to be modified. 
3.4.2. Types of Strategies 
There are a number of different strategies that can be used to improve speed. 
These are illustrated in Figure 8. Strategy 1 swaps equivalent signals on the same 
component. For example, the 2-input AND gate in Figure 8a has a different delay 
from each input to the output. Thus the critical path should be connected to the 
input with the shortest delay. Strategy 1 can be implemented by examining a tech-
nology file that contains timing information on each component. This strategy has 
no cost as it does not change any circuit elements. However, the gain produced is 
small. Thus strategy 1 is useful when the slack (difference between the actual and 
required times) is extremely small or when strategies that produce larger gains have 
been exhausted. 
41 
A D- Out B 
A -> Out .Sns 
B -> Out .7ns 
(a) Swap equivalent signals on the same component 
R,B->C .7ns R, B -> C .Sns 
Power .65m8 Power .75m8 
(b) Replace macro with one of higher speed 
8 
B 
~ c 0 
E 0 
E 
(c) Factor 
Figure 8. Techniques for Reducing Delay Along Critical Paths 
Sel 
B 
c 
42 
y => A B 
2 to 1 
MUH 
Sel 
( d) Better macro selection that does not increase area or power 
Outl 
=> 
B 
c 
y 
Out1 
Out2 E[)- Out2 
(e) Duplicate logic 
(f) Better macro selection at the expense of area/power 
Figure 8 (cont.) 
A 
B 
c 
A 
B 
C ->FOUT 
Is critlcel 
c 
FOUT 
43 
(g) Minimize 
A 
B 2 to 1 
MUH 
A 
y FOUT 
B 
A Sel 
B 
c---------___. 
(h) Duplicate logic with multiplexor 
Figure 8 (cont.) 
44 
Strategy 2 replaces a low-povver or standard-power macro with a high-power 
macro of greater speed. This type of strategy replaces a macro with another that 
uses larger transistors to provide faster switching times. Strategy 2 can be imple-
mented by examining the technology file to determine if the same macro exists with 
higher power, higher speed. The strategy is similar to strategy 1 as it produces only 
a small gain. Thus it would be used in the same situations as strategy 1. However, 
since strategy 2 increases the power, strategy 1 is preferred. 
Strategy 3 employs factorization to reduce the delay of a critical path. Figure 
8( c) shows how a four input AND gate can be factored to speed up path D. Imple-
mentation is a factorization program such as that found in MIS (Br Ru87) or 
ESPRESSO [Br84]. The gain is typically small though greater gains are achieyed 
for larger gates. In addition, using factorization along the entire critical path can 
add up. Power generally increases but the effect on area may vary. Strategy 3 
tends to produce a slightly larger gain then strategies 1 or 2. Hence, it will be used 
when the slack is small. 
Strategy 4 attempts to make a better macro selection that reduces time 
without an increase in area or power. This strategy may utilize a rule base contain-
ing only rules that will produce a gain for no cost. For instance, a rule could be 
written in OPS83 to perform the transformation of Figure 8( d). An alternative 
method is use of a hash table. Lookup in the hash table is accomplished through a 
key that is the truth table entry for a particular function. The hash table is typi-
45 
cally limited to entries of up to five variables 1 making each hash table key a max-
imum of 32 bits -- a common computer word. Certain transformations cannot be 
represented by a hash table. For example, functions with multiple outputs 1 such as 
a decoder, or circuits with sequential logic elements, such as a counter. In these 
cases it is necessary to use the rule-based approach. A hash table has an advantage 
over the rule-based approach in that fewer transformations need to be entered. For 
example, Figure 9 shows two different implementations of a multiplexor that can be 
represented by the same hash table entry, but require two separate rules. Another 
advantage of hash table lookup is speed. It requires only one table lookup per func-
tion. Of course time is required to find a circuit's function but this may require less 
time than having to search through a set of rules to dete~mir.ie which are applicable. 
Since Strategy 4 has no cost, it will be used before strategies 2 and 3, and will be 
the first strategy examined for moderate gain. 
Strategy 5 duplicates logic along a critical path, thereby doing the reverse of 
common term factorization. Like strategy 4, it can be implemented using either a 
rule-based or hash table approach. The gain from strategy 5 is typically small and 
hence the strategy would be applicable at the same time as strategies 1 - 3. As the 
cost in area and power tends to be greater than strategies 1 - 3, strategy 5 would be 
the last examined. 
Strategy 6 is similar to strategy ·4 except a better macro selection is made with 
an increase in area and/or power. It can make use of a hash table or rule based 
46 
f Hash Table Entry 
10 11 Sel F 
0 0 0 0 
0 0 1 0 
0 1 0 0 
0 1 1 1 
1 0 0 1 
1 0 1 0 
1 1 0 1 
1 1 1 1 
10 
Sal F 2 to 1 
Mux 
11 10 DO ) y F 
11 01 10 Sal 
Sal F Sel 
11 
Figure 9. Use of a Hash Table to Reduce the Number of Rules 
47 
approach as well. Typically a moderate gam can be achieved through strategy 6 
with a small to moderate increase in the power and area constraints. It is often 
considered for moderate slack improvements or for large slack improvements after 
large gain strategies have been examined. 
Strategy 7 is the foundation of both SOCRATES and the Logic Consultant. It 
expands the design into two-level SOP form then minimizes by removing redundant 
terms. This strategy is often combined with strategy 3 to factor the circuit along 
non-critical paths and take advantage of common terms. This strategy is the most 
time consuming but can produce large gains, often with no increase or only a small 
increase in area and pmver. Thus this strategy is the preferred method for improv-
ing large slacks. 
Strategy 8 is one discussed in [JoMc87). It duplicates logic that strategy 5 
can't by adding a multiplexor. Figure 8(h) has a function FOUT = F(A,B,C) where 
C -> FOUT represents the critical path. The logic network may be duplicated with 
the C input connected to GND in one, and VDD in the other. The real C input is 
then hooked up to the select input of a multiplexor. The gain from Strategy 8 is 
large but so is the cost if little optimization can be performed after increasing the 
amount of logic. New circuit elements are added which increases both area and 
power. Hence, strategy 8 will be examined for a large slack but will be considered 
after less costly strategies have been attempted. 
48 
3.4.3. Choosing a strategy 
The control strategy can be changed depending on how far the critical path is 
from the timing constrain ts. \Vhen the time difference is small, a local optimization 
can be attempted using some combination of strategies 1 - 4. Rules from three 
different categories can be examined: those that do not increase area and power, 
tradeoff rules that improve speed but increase either area or power, and tradeoff 
rules that increase both area and power. Those rules in category 1 will be con-
sidered before exammmg rules in categories 2 or 3. Similarly category 2 rules are 
preferred over those in category 3. 
vVithin each rule category, the smallest rule with the maximum gain will be 
chosen. Thus rules involving only two inputs will be examined first. If one of them 
meets the timing specifications, that rule will be applied. Otherwise rules with 
three inputs must be examined. This process continues until all rules have been 
examined at which time a new strategy must be applied. 
When the time difference is great or all other strategies fail, the circuit can be 
minimized into a two level circuit using strategy 7. It can then be expanded 
through weak division into multiple levels as in strategy 3. 
49 
CHAPTER 4. 
THE l\1ILO SYSTEM 
4.1. Introduction 
This chapter discusses some of the previous approaches in converting microar-
chitecture designs to a technology-specific design and presents some of the unique 
features of the microarchitecture optimizer in the MILO system. In addition, the 
synthesis framework in which MILO is used is described. 
4.2. Previous Work 
Various approaches have been taken to convert the microarchitecture design 
into a design that can be passed to a layout tool. Some tools extract the behavior 
as a set of boolean equations and flip-flops, then rely heavily on logic synthesis tools 
to reduce the logic and make an efficient design [Br86) [StMu86) [TsWe88] 
[W eRo88]. They employ logic generators that produce a design of generic logic 
gates from each microarchitecture component's description, then use tools such as 
[BrRu87] to reduce the number of gates in a component, restructure critical paths, 
and map the design into a particular standard cell or gate-array library. The design 
• ! 
50 
can then be passed to a standard cell or gate-array layout tool. 
SILC [GuPa90) includes a component rearchitecting step that selects a different 
style of architecture for components along a critical path. For example, a ripple 
carry adder can be converted to a carry-lookahead adder or something in between 
to improve the speed. Thus the style of component can be changed after logic 
optimization fails to meet the necessary constraints. 
Other behavioral synthesis tools designed for datapath generation can base 
their architecture on a standard cell layout or a bit-sliced layout (TrDi89). Optimi-
zation is carried out for that particular layout style. Still another approach is to 
construct the design using off-the-shelf components [BiBr88) including microproces-
sors, DMA controllers, dynamic RAMS, etc. 
Numerous tools for behavioral and logic synthesis have been previously 
reported. This chapter describes a system that fills the gap between the behavioral 
synthesis tools and logic synthesis tools by using a microarchitecture optimizer. 
Behavioral synthesis tools often use estimators in design refinement. These estima-
tors may be technology independent and are usually not accurate enough to make 
decisions for fine tuning the microarchitecture design. On the other hand, logic 
synthesis tools can accurately gauge area and time but operate on too low of a level 
to adequately make microarchitecture modifications. 
In chapter 6, techniques are introduced for improving the microarchitecture 
structure and for employing constraint driven synthesis based on the user's 
51 
requirements for time and area. These techniques include the capability for mixing 
layout styles such as custom layout for random-logic components and bit-slicing for 
regularly structured components. In this manner the entire design~ control logic 
and datapath, can be optimized at the same time. Further, a new methodology is 
presented for microarchitecture-level optimization that greatly reduces the amount 
of technology-specific knowledge necessary to perform the optimizations. Microar-
chitecture components are generated by a database based on a set of parameters 
from the microarchitecture optimization tool. Thus the microarchitecture optimizer 
does not need to deal with multiple logic optimization tools, layout module genera-
tors, transistor sizing tools, etc. 
Often the structures produced by behavioral synthesis tools contain 
inefficiencies such as constants that can be propagated through a design, and com-
mon subexpressions that appear multiple times in the design, each time with repli-
cated hardware. These can partly result from the fashion in which the user wrote 
the behavioral description. Also the design needs to be directed towards a certain 
set of constraints for time and area. Tradeoffs must be made along different paths. 
On critical paths optimizations that reduce time are required, possibly at the 
expense of increased area. Non-critical path optimizations attempt to reduce area 
as long as doing so does not create a new cr:.itical path. In performing these 
tradeoffs, the microarchitecture optimizer can select a different architectural style 
for the component, merge components and reoptimize their logic, insert buffers to 
improve drive capability, replace a set of components with a single component that 
52 
performs the same function but more closely meets the constraints, restructure com-
ponents to reduce delay (such as factoring multiplexors), duplicate logic to reduce 
delay, or change the layout style of the component. These type of improvements 
~~> --~:;- .. -'~-.-_•. -~-; 
are nearly. im.120;.$ib'i~'tdipursue once the design has been expanded in1H'Jf'1owe'I'-level 
logic. 
4.3. 
- .. 
s:if1tt.b[esJz~>~~mU~~~ VHDL .. heha-~i-Oral cl'e,a.i;, ..,.,,f'l!i¥Wx 
tions. The system architecture 1s shown in Figure 10. It consists of six major 
pieces: a component database, logic optimization tools, a behavioral synthesis tool, a 
technology mapper, a microarchitecture optimizer, and a fl.oorplanning/layout sys-
tern. 
4.3.1. C-Omponent Database 
The central system tool is a component database [ChGa90]. It supplies com-
ponents and statistics on components to the synthesis tools. Synthesis tools can 
pass a set of parameters and specifications to the database and then receive a list of 
components that meet the requirements. Parameters include the component type 
( eg., ALU, Counter, MUX), number of inputs, clock type (rising-edge, falling edge), 
etc. Specifications include the load that each output pin must drive, the maximum 
delay to each output pin, and an area requirement. 
Component 
Database 
Layout 
Module 
Generators 
53 
VHDL 
Behavioral 
Description 
Component Parameters ,..---....... ----. 
Generic Components 
(GENUS Library) 
Behavioral 
Synthesis 
Generic 
Microarchitecture 
Netlist 
Component Parameters -------
Technology-Specific 
GENUS Component 
Component Parameters + 
Component Specifications 
Technology-Specific 
Logic 
Optimization 
Tools 
Component 
Component Layout 
Style 
Layout of 
Component 
Technology 
Mapper 
Microarchitecture 
Optimizer 
Floorplanning/ 
Layout 
Figure 10. MILO Environment 
MILO 
54 
The component database contains a library of logic generators that produce a 
boolean equation representation that describes the low-level behavior of the com-
ponent. One or more generators can be selected based on the parameters supplied 
by the synthesis tool. The boolean equations include constructs for describing 
sequential logic so that logic generators for components such as registers and 
counters can be constructed. The boolean description is passed to a logic optimizer 
(VaGa88) with a set of time constraints. The logic optimizer produces a 
technology-specific design using components from a designated library or can gen-
erate complex gates and select transistor sizes for use in a custom layout. The logic 
optimizer produces a report file listing delays and area. This information can be 
passed to synthesis tools when they request such information about a component. 
The database also contains knowledge about components that can be produced 
by layout module generators. Estimators provide data on delay times and area 
based on the bit-width. 
4.3.2. Behavioral Synthesis 
A behavioral synthesis tool (LiGa89] accepts a VHDL behavioral description 
and produces a VHDL structural netlist consisting of generic components from 
GENUS [Dutt88], a library of generic microarchitecture components. One special 
property of GENUS components is the use of one control line per function. Thus a 
four-bit multiplexor has four data-in lines and four select lines -- one to control each 
data line. In an ALU, there are separate control lines for ADD, SUBTRACT, AND, 
55 
OR, etc. This component property removes the problem of control encoding from 
behavioral synthesis as component encodings may depend on a particular technol-
ogy library. If necessary control encoding can be performed later during technology 
mappmg. 
The behavioral synthesis tool begins by converting the input description into a 
dataflow graph. A graph critic then operates on the dataflow graph, removmg 
redundancies in the behavioral description. The behavioral operators are then 
bound to GENUS components. The final architecture produced by the behavioral 
synthesis tool consists of random logic blocks of control logic and a datapath con-
taining components such as ALUs, shifters, and registers. 
The components in the generic netlist are converted to technology-specific com-
ponents by a technology mapper. The technology mapper queries the database by 
providing the set of component parameters. The database returns one or more com-
ponents that meet the specified parameters. From this set of components the tech-
nology mapper selects the component that contains the smallest set of functions 
required. For example, if a component with the ADD and SUBTRACT functions is 
requested, the database may return two components: an ADD /SUBTRACT unit 
and an ALU. 'The technology mapper would select the ADD/SUBTRACT unit. 
Since the technology mapper does not pass a set of timing or area constraints t~ the 
database, the database will return the most area efficient design. Currently the 
technology mapper maps generic components into only components that are imple-
.56 
men ted from gates and optimized by the logic optimizer. Later implementations 
will include mappmgs of other types of components, such as those from layout 
module generators. In any event, these types of components are currently inserted 
later, during the microarchitecture optimization phase if appropriate. 
4.3.3. l\ficroarchitecture Optimization 
At this point the design consists of two levels. One is the microarchitecture 
netlist, the other is a technology-specific gate-level netlist for each microarchitecture 
component. The microarchitecture optimizer first employs rules that make transfor-
mations that should improve both time and area. For example, converting a regis-
ter and incrementer into a counter. Next the critical paths are identified. The 
optimizer requests faster components from the database, selects different layout 
styles (random logic or bit-sliced), and decides which components to merge and 
apply logic optimization. Once critical paths have been processed, the microarchi-
tecture optimizer operates on non-critical components, making similar decisions as 
in the critical path improvement phase but this time with an eye toward area 
improvements. The microarchitecture optimizer then produces a VHDL netlist that 
is passed to the floorplanner /layout assembler for layout. 
The microarchitecture optimizer uses a new methodology for selecting microar-
chitecture components to be used· in the design. The microarchitecture optimizer 
does not perform component rearchitecting and does not have knowledge of tools for 
logic optimization, transistor sizing, and other component reoptimization 
57 
techniques. Instead, these tasks are left to the component database. The microar-
chitecture optimizer passes a set of time/ area constraints to the database and the 
database examines possible ways to achieve the constraints. The database can 
choose from different architectural styles and can choose from multiple optimization 
tools to redesign the component. This frees the microarchitecture optimizer from 
dealing with technology concerns and having to know what set of component optim-
ization tools exist at any one time. All of this is centralized in the database. 
Integration of the database with the microarchitecture optimizer and the logic 
optimization tools is achieved with two servers (ChGa89] as shown in Figure 11: a 
component server, and a knowledge server. The component server is the part that 
interfaces with the microarchitecture optimizer. Queries are made from the 
microarchitecture optimizer to the component server through the Component Query 
Language (CQL) and a list of components or a set of component attributes are 
returned. In this manner, the microarchitecture optimizer can simply request the 
functions required of a component, an layout implementation style, and a set of 
delay parameters. From this information the component database checks its com-
ponent list which includes fixed components (components that have already been 
generated) and parameterized components (those that can be generated when pro-
vided a set of parameters). The component database knows from its component list 
whether a component generator needs to be called to generate a design for the com-
ponent or whether the component design already exists (as in the case of a fixed 
component). Once a component is generated, the database can call an appropriate 
58 
Component 
Component Definitions Component Attribute 
Component Generator Tools Requests Requests 
Knowledge 
Server 
Fixed 
Components 
Parameterized 
Components 
Component 
Instantiations 
Component 
Generator 
Tools 
Component 
Server 
Component 
Selector 
Attribute 
Retrieval 
Optimization/ 
Layout Tools 
Figure 11. Tool Interface with the Component Database 
59 
logic optimization tool or layout tool. 
The knowledge server is used to insert new fixed components, insert new com-
ponent g~nerators, and insert logic optimization and layout tools. Thus when a new 
logic optimization or layout tool is available, the knowledge server will be accessed 
to store information about how to call the new tool. Also designers can build their 
own components and_j~~~::~he~~~Jt"~;~~~~p~m~ig~~i~ through the 
know ledge server. 
~r 
::-:- -! . .,:;~Ef"-~J-·r~-~·-<- -~~=-=-~~-~-::~,;._.;~ .. ~~:t:·-:r.;. ,~:1%,+~~-;:'°:&.'-,~\~;~~~~~·-~ _:.r:~;.~~~~:?-'t~~~~~.:_ 
4.3.4. Floorplanning/Layout 
Finally the technology-specific microarchitecture netlist is passed to a layout 
synthesis system that performs floorplanning on the microarchitecture design and 
creates a layout for each component [WuGa90]. The layout tool decides how to 
tecture optimizer's selection~-of~it':"~·d~~µi:~?:J~Wn~~*-~,-' '""-;:g them to random 
0 .~· ,_ :i>~ -; ·i+ ~~; 'i~~; :~ 
logic if doing so will result"i~ter · "'dil}~n~rafors are called to pro-
!_ -~7!i;..-,. _';~~~.;~:-~~-1:?"0::-::--i~ ':. 
-
duce the bit-sliced layouts ang ~"'~~om layout, g~c;~ir~~~~Q~l~:i4o produce a layout 
--:";;°E~,;,7 - ~-._i;;:'L~;_-,·.: ::"£"-·'~;,~--r~(;;,~(-i$!~~~4~,c:Jt 
for each random logic block~ . The floorplanner sefects how to partition the random 
logic based on shape sizes that can be used to fill in the bit-sliced logic mismatches. 
i 
• 1 
60 
The fioorplanner attempts to place similar-sized bit-sliced components together 
and place random logic into slots where mismatches in the length of the bit-sliced 
logic occurs. Other random logic is placed along the border of bit-sliced com-
ponents. 
61 
CHAPTER 5. 
LOGIC SYNTHESIS FOR CUSTOM LAYOUT 
5.1. Introduction 
Conventional logic synthesis systems produce designs for gate array and stan-
dard cell libraries. Such semi-custom libraries are popular for their simplicity in 
layout and the ability to use automatic layout tools. A major weakness of the 
semi-custom approach is the larger area required for routing, producing a lower lay-
out density than custom methods. In addition, all transistors are of the same size, 
capable of driving some expected average load. This average load is usually greater 
than what is required, making cells larger than necessary. When a larger load than 
the preset value must be driven, speed is sacrificed. 
In order to achieve better layouts, more attention has been focused on tools 
that offer custom layout. Custom layout permits greater control over layout param-
eters, providing smaller area and faster designs than its gate-array or standard cell 
counterparts. For example, Figure 12 shows three layouts for the same logical func-
tion. The first layout was done using standard cells; the second and third, using 
custom layout, were optimized for area and time, respectively, by varying transistor 
sizing and complex gate formation. The second layout has 22% less area than the 
standard cell version and the third layout is 30% faster than the standard cell ver-
62 
(a) Standard Cell Layout (b) Custom Layout for Area ( c) Custom Layout for Speed 
Figure 12. Standard Cell Layout vs. Custom Layout 
s10n. 
In custom layout, cells are generated dynamically instead of having fixed 
libraries and transistors can be sized according to the load that must be driven. 
Thus while standard cell and gate-array layouts are very important for quickly 
bringing chips to market, custom layouts still provide faster designs with smaller 
area. When performance is crucial or a large volume of chips is needed, custom lay-
out is necessary. Custom layouts can be generated from a gate-level netlist either 
by hand or by custom layout generators such as LES [LiGa88], CLAY [KoLu88), 
and CLEO [DoLe89]. 
63 
This chapter introduces techniques to generate gate-level netlists that take full 
advantage of the custom layout capabilities. Such techniques include limiting 
serial/parallel transistor chains, transistor sizes, and capacitive loads in forming 
complex gates. These considerations have not been incorporated in previous logic 
synthesis systems. Hence the intended results sometimes fail to be achieved when a 
custom layout is used. By extending existing logic synthesis methods that were 
aimed at standard cell layout, we can produce gate netlists that assure good results 
for custom layout as well. For example, consider Figure 13 which shows two 
different implementations of the same logical function. The design in Figure 13( a) 
has been optimized specifically for area. Figure 13(b) displays the implementation 
that has been optimized for time. Both designs were produced using standard cell 
synthesis techniques. When the layout was implemented in standard cells, the 
optimization achieved the desired effect. However, when these implementations 
were run through custom layout generators, the design of Figure 13(b) had the best 
speed AND the best area as shown in Figure 13( c ). 
5.2. Pararmters for Custom Layout 
5.2.1. Co11¥>lex Gates 
Layout driven synthesis is not restricted to a fixed library of components. In 
many cases multi-level Boolean functions can be implemented as a single complex 
gate. Complex gates have fewer connections and fewer transistors than multiple 
64 
'OUT 
(a) Optimized for Area 
'OUT 
(b) Optimized for Time 
Custom Layout 
Optimized 
for: #of Layout Delay 
Tran-
sisters 
Area (uni) (ns) 
Speed 56 56048 9.61 
Area 54 59800 9.78 
( c) Layout Comparison 
Figure 13. Custom Layout Results Using Synthesis for Standard Cells 
65 
gate implementations and hence may reduce area and improve performance. An 
example of a complex gate is shown in Figure 14( a) and contrasted with the multi-
ple gate implementation in Figure 14(b) (with inverters assumed to be pushed back 
to previous levels). Both implementations are shown in the CMOS technology with 
cells having a P and N transistor section. 
In layout driven synthesis, the optimizer has great flexibility in deciding which 
gates to combine into a single gate. The type of complex gates formed is limited 
mainly by the buildup of capacitance, the load the gate must drive, and the transis-
tor speed. As more transistors are placed in series, the resistance and parasitic 
capacitance of the gate increase, resulting in longer delay times. In CMOS, where 
gates have an N and P transistor section, carrier mobility ··in P-type transistors is 
twice as slow as N-type transistors. NOR gates, which have P-type transistors in 
series, can be 2 to 3 times slower than gates with N-type transistors in series (such 
as NAND gates). This makes it beneficial to limit the number of transistors that a 
complex gate has in series and parallel. In CMOS, fewer P-type transistors should 
be allowed in series than N-type transistors. Hence, a compromise must be made 
between the solution consisting of only single gates and the solution involving com-
plex gates with a large number of transistors. As a general rule, complex gates with 
few transistors are desired along paths where timing is critical while complex gates 
with more transistors are desired in.sections where area is most important. 
66 
VDD 
c4 
F 
B 
A4 f>-s 
A 
F 
c -i ~A 
~B 
-
(a) 
VDD VDD 
04 
B F 
x4 
A F 
8 --i ~A ~c 
- -
(b) 
Figure 14. Complex Gate Formation 
67 
5.2.2. Transistor Sizing 
Another concern in custom layout is transistor s1zmg. As discussed in 
[FiDu85]~ [He81]. [Ci87], and [ObKa88]. choosing proper transistor sizes is key to cir-
cuit performance. Figure 15 demonstrates how changes in transistor size affect 
speed. Increasing the size of transistors in Gate B improves its speed. However, the 
larger transistors create a larger capacitive load for Gate A. This slows down Gate 
Total 
Path 
Delay 
I CL.cad 
Transistor Size of Gate B 
Figure 15. Effect of Transistor Sizing on Path Delay 
68 
A and hence the entire path unless Gate A's transistors are increased based upon 
the new load. The size vs delay relationship is a convex curve as in Figure 15(b). 
Th us sizing a gate ~s transistors excessively large can increase the total path delay if 
the transistor sizes of the gates that drive it are not increased as well. 
Transistor sizing gives greater control over area/speed tradeoffs. For example, 
m CMOS where NOR gates tend to be slower than N AND gates, synthesis systems 
try to avoid using NORs or use NORs with fewer inputs. Layout driven synthesis 
systems can increase the size of the P transistors, thereby improving speed (i.e., 
making the gate as fast as a N AND) at the expense of increased transistor area. 
Since transistor sizing is not variable in standard cells there is no need to con-
sider load when using complex gates. However, for custom layout, gates must be 
combined in such a way to avoid a number of problems. For example, as the load a 
complex gate drives increases, transistor sizes must be made larger to prevent a 
decrease in speed. Larger transistors require strips with greater height creating the 
problem shown by the two layouts in Figure 16 The first layout consists of seven 
N AND gates., In the second layout the final three N AND gates have been combined 
into a complex gate. Both designs drive an output of .1 picofarads (roughly 2-4 
fanou ts). Even though the transistor area for the complex gate is less than that for 
the individual gates, total area has been increased. One can see from Figure 16(b) 
that a tall cell in a strip is not compatible with shorter cells, generating wasted 
space along the top and bottom of the channel. In addition, larger transistor sizes 
69 
2 Area = 312 * 90 = 28080um , T. = 14. 78ns, T1,_ = 13.89ns (af'• -
2 
Area= 256 • 122 = 31232um(b )"•• = 14.44n1, T1.a = 15.lln• 
Transistor sizes shown inside gates aa PFET size/NFET size 
Only the last 4 NAND gates' sizes are shown 
Figure 16. Effect of Large Transistor Sizes in Complex Gates 
• 
70 
increase the load for all gates in the preceding stage, requiring them to be increased 
in size if the present speed is to be maintained. This creates a chain reaction 
affecting all preceding stages that can significantly increase the area. The design of 
I 
Figure 16(b) is slower than that of Figure 16( a). The delays could be made equal 
by using even larger transistors sizes, resulting in a much larger area for the same 
delay. Hence the formation of complex gates should be dependent upon the load. 
Single gates or complex gates with few transistors should be used to drive heavy 
loads. In general capacitive loads are larger closer to output pins and smaller closer 
to input pins. This indicates that complex gates with a larger number of transistors 
can be created near input pins, while fewer transistors should be used in complex 
gates near output pins. 
5. 2.3. Placerrnnt and Routing 
Control over component placement is important as cells along paths having 
critical delays should be placed close to one another. This prevents long wires from 
connecting them and introducing further delays. Further, large size gates should be 
placed on the edges of the floorplan since routing is sparse near boundaries. Same 
size gates can be placed in the same row to reduce wasted space from uneven cells. 
This increases the routing but decreases overall area. Such a scheme could have 
been used to correct the problem of Figure 16(b ). 
A related concern is in routing. Routing should be performed on critical com-
ponents first to ensure that they get the shortest paths. In addition, long wires 
71 
should be placed in the metal layer for better speed. A synthesis program can aid 
layout by assigning priorities for the layout program, indicating which components 
need to be placed close together or which cells are similar in size. 
5.2.4. I/O Positioning 
Placement of I/O pms also is crucial for quality layouts. Pins can be placed 
near the modules having critical timing or placed in a position that reduces overall 
routing and saves area. They must be kept well distributed to prevent congestion 
in the center of the module and wasted space around the boundary of the module. 
Thus information on good I/O positioning can help layout tools create faster or 
more dense layouts. 
5.3. Strategies for Layout Driven Synthesis 
There are several strategies for con trolling_ complex gate formation and transis-
tor sizing. 
One simple strategy is to form complex gates first and then size the transistors. 
During the complex gate formation stage, transistors are assumed to be unit sized. 
For example, in a 3 micron CMOS technology all NFETs and PFETs would initially 
be given a size of 3 microns. The_ weakness of this strategy is that transistor sizes do 
not influence complex gate formation and thus create complex gates may be created 
with many large transistors. This unnecessarily increases the layout area. 
i2 
Using a second strategy, transistors are sized first and then the complex gates 
created. With this approach more realistic transistor sizes are used in deciding how 
to form complex gates. Those gates with small transistor sizes are used to build 
complex gates; those with large sizes are mostly left untouched. The new complex 
gate's transistor size is based upon the largest transistors of the gates that were 
merged. When this new complex gate is inserted, the gates that drive it may be 
driving larger transistors than they were before. Since these gates must drive a 
larger capacitive load, they are now undersized if the present speed is to be main-
tained. 
The third and most complicated strategy involves forming complex gates and 
sizing transistors at the same time. Before a decision is made to create a complex 
gate, all gates that the new complex gate will drive must be processed. That is, no 
changes in transistor sizes or complex gates can be made along any paths that can 
be reached from the output of the new complex gate. Doing so could change the 
load that the complex gate is required to drive, resulting in "the problems encoun-
tered in the second strategy. After a new complex gate is created, all gates that 
drive it are resized. This approach will presumably provide the best results. 
5.4. Algorithm for Layout Driven Synthesis 
We present an algorithm for producing high-performance CMOS custom lay-
outs. Our algorithm consists of four phases as shown in Figure 17. The input is a 
set of boolean equations. These equations are first minimized and then factored by 
73 
Boolean 
Equations 
Minimize 
and 
Factor 
Time 
Optimization 
Technology 
Mapping 
Transistor 
Sizing 
Custom 
Layout 
Generator 
Figure 17. Algorithm for Layout Driven Synthesis 
MISH to make use of common terms. Parameters for the maximum number of 
NFETs and the maximum number of PFETs allowed in series are provided for the 
next phases to place limits on the size of complex gates created. In the second 
74 
phase, the algorithm reduces the number of levels along the longest paths by per-
forming balanced factoring. This technique attempts to partially collapse the criti-
cal path then refactor so as not to exceed the maximum number of transistors 
allowed per gate. The algorithm's third phase then performs technology mapping 
by combining gates in to complex gates. The fourth phase sizes the transistors, pro-
ducing a design that can be passed to a custom layout generator. 
5.4.1. Tirre Optimization 
The general idea of timing optimization is to perform a limited collapse of nodes 
along the critical path. This requires some duplication of logic, creating a design 
with greater breadth and shorter depth. Some logic along critical paths can be 
removed and added along non-critical paths. More than just a partial collapse is 
required, however, to produce high-performance designs. We employ a balanced 
factoring method that addresses the problem of how many transistors to place in 
each gate. Gates are factored to get the number of inputs per gate that achieves 
the best speed rather than attempting to reduce the number of transistors. 
Three operations a.re used to reduce the number of levels: distribute, extract, 
and merge. These operations are shown as applied to a graph of operator nodes in 
Figure 18. Distribution transfers logic across other nodes, allowing some logic to be 
shifted to non-critical paths. Extraction removes non-critical inputs from a node, 
reducing the delay through the node. Merging combines two nodes having the same 
operator. Only two critical nodes should be merged in order to reduce the delay. 
75 
w x w y z 
y z 
(a) Distribute Operation 
=> x y z 
x y 
(b) Extract Operation 
x y z 
y z 
( c) Merge Operation 
Figure 18. Critical Path Reduction Operations 
i6 
Use of the three operations is illustrated in Figure 19. Figure 19(a) is a design of 
five levels. Distributing node 2 over node 3 produces the design of Figure 19(b) 
which has one fewer level. Using the extract operation then on node 4 balances out 
the paths through that node ( Figure 19( c) ). Node 2 can be merged into node 0 ( 
Figure 19(d)) and an extract operation applied to node 0 (Figure 19(e)) to balance 
the path lengths. In this manner, critical paths can be shortened by shifting logic to 
non-critical paths. 
The input equations which have been minimized and factored are converted 
into a tree structure similar to that in Figure 19. Nodes may have only as many 
inputs as the number of transistors allowed in series. If performing the optimization 
at a node creates a new longest path or requires the addition of too many nodes, the 
optimization will not be performed. We perform the timing optimization going from 
inputs to outputs. Duplication of logic through distribute operations is preferred 
closer to input pins as there are fewer gates to be duplicated. Also transistor ·sizes 
are smaller closer to the inputs and duplicated logic can be placed in complex gates 
with more transistors. These factors help to contain the extra area that results. 
(a) 
A B G 
(c) 
R 
77 
J 
J 
S A B E 
J 
(b) 
A B G J 
(d) 
A B G J 
(e) 
Figure 19. Example of Shortening Critical Path 
Algorithm 1: Balanced Factoring 
{restructure critical paths} 
Let 
V be the set of vertices in the design; 
78 
CP be the set of vertices along the critical path; 
procedure balancedJactor(V) 
begin 
CP
0 
= find_criticaLpath(V); 
can_improve = TRUE; 
while ( can_improve) 
balance(V, CP
0
) 
CPn = find_criticaLpath(V); 
if (CP0 = CPn) then 
can_improve = FALSE 
else 
CP0 = CPn; 
end 
end 
end 
{Redistribute Logic Along the Critical Path} 
Let 
D[i,VJ be the cumulative delay at node i of design V; 
D accept be a delay reduction constraint; 
AN[i] be the increase in number of nodes due to optimization of node i; 
Aaccept be an area increase constraint; 
procedure balance(V, CP) 
begin 
for all i E CP 
Vnew = distribute(V, i, i+l); 
Vnew = extract(Vnew' i+l); 
Vnew = merge(Vnew' i+l, i-1); 
vnew = extract ( vnew' i-1); 
if.(D[. r11 - Dr· 1 TT. 1 > D ,) and (AN. < A) then 1,. J 1- , vnew accep 1 
v = vnewi 
end 
end 
end 
{Distribute node i over node i+ l} 
Let 
Si be the set of input nodes to node i; 
proceduredistribute(V, i, c) 
begin 
remove c from S. 
I 
For allj E Sc 
if operator(i) = operator(j) then 
For all k E Si 
1 = duplicate_tree(k); 
add 1 to Si; 
end 
extract(V, j); 
else 
m = duplicate_tree(i); 
remove j from Sc; 
add m to Sc; 
add j to Sm; 
extract(V, m); 
end 
end 
add c to Si- 1; 
end 
79 
{Remove Non-Critical Inputs From Critical Nodes} 
Let 
CDi be the cumulative dela~ at the critical input of node i; 
Si be the set of inputs of node i; 
Tc be the transistor limit constraint 
P be the delay percentage 
Algorithm 1: Balanced Factoring 
{restructure critical paths} 
Let 
V be the set of vertices in the design; 
78 
CP be the set of vertices along the critical path; 
procedure balancedJactor(V) 
begin 
CP
0 
= find_criticaLpath(V); 
can_improve = TRUE; 
while ( can_improve) 
balance(V, CP
0
) 
CPn = find_criticaLpath(V); 
if (CP0 = CPn) then 
can_improve = FALSE 
else 
CP0 = CPn; 
end 
end 
end 
{Redistribute Logic Along the Critical Path} 
Let 
D[i,VJ be the cumulative delay at node i of design V; 
D accept be a delay reduction constraint; 
AN[i] be the increase in number of nodes due to optimization of node i; 
Aaccept be an area increase constraint; 
procedure balance(V, CP) 
begin 
for all i E CP 
Vnew = distribute(V, i, i+l); 
Vnew = extract(Vnew' i+l); 
Vnew = merge(Vnew' i+l, i-1); 
Vnew = extract(Vnew' i-1); 
if,(D(i,V] - D[i- i,vnew] > Daccept) and (ANi < A) then 
v = vnewi 
end 
end 
end 
{Distribute node i over node i+ l} 
Let 
Si be the set of input nodes to node i; 
procedure distribute(V, i, c) 
begin 
remove c from S. 
I 
For allj E Sc 
if operator(i) = operator(j) then 
For all k E Si 
I = duplicate_tree(k); 
add I to Si; 
end 
extract(V, j); 
else 
m = duplicate_tree(i); 
remove j from Sc; 
add m to Sc; 
addj to Sm; 
extract(V, m); 
end 
end 
add c to Si- 1; 
end 
79 
{Remove Non-Critical Inputs From Critical Nodes} 
Let 
CDi be the cumulative dela~ at the critical input of node i; 
Si be the set of inputs of node i; 
Tc be the transistor limit constraint 
P be the delay percentage 
procedure extract(V, i) 
begin 
input_cnt = Tc; 
80 
sort Si by delay (worst delay to least delay) 
if (length (Si) > 1) then 
k = node_duplicate(i); 
end 
if (length (Si) > TJ then 
For allj E Si 
if (input_cnt = ~ and length( Si) > 1) thai 
n = node_duplicate(i); 
add n to Sk; 
k = n; 
input_cnt = 1 
else 
input_cnt = input_cnt + 1; 
end 
add j to Sk; 
remove j from Si; 
end 
end 
num_inputs = O; 
~=Si; 
k = node_duplicate(i); 
For allj E Si 
if Di < (P * CDJ then 
add j to Sk; 
remove j from Si; 
num.J.nputs = num_inputs + 1; 
end 
end 
if (num_inputs > 1) then 
add k to Si; 
else 
Si= i;; 
end 
end 
{Combine Critical Nodes} 
procedure merge(V, Ni, Ni- 1) 
begin 
if (operator( Ni) = op er a tor (Ni- 1)) 
For allj E Ni 
remove j from Ni; 
add j to Ni- 1 ; 
end 
end 
end 
5.4.2. Technology l\tJ.apping 
81 
• 
The third phase involves technology mapping. To perform the mapping, esti-
mates of the gate delay times are made in order to identify the critical path. Each 
gate is converted to a N AND /NOR gate and assigned a delay based upon the 
number of NFETs in series, the number of PFETs in series, and the load that the 
gate must drive. The critical path is found using a method similar to that discussed 
in (YeGh88). 
The next step is complex gate formation. The algorithm forms complex gates 
first along the critical path. The combining algorithm goes from input pins to out-
put pins, which tends to produce large complex gates near the inputs and small 
complex gates at the outputs (as there are fewer gates left to combine). Complex 
gate formation is restricted by the user entered parameters for maximum transistors 
in series, the number of fanouts that the gate must drive, and the amount of slack. 
We have found that single gates or complex gates in which the critical path passes 
through only one gate level of the complex gate are best along the critical path. 
82 
Figure 20 shows an example demonstrati11g tb.is" \v:~th 
''· .Y-:,;"'*'.1_'}..:,;,;~r-.,,--1'_ .... ·P. 
times m a 3 
rmcron CMOS technology. The complex gate formed m Figure 20( b) has only a 
small effect on the delay through the critical path. Howe\·er. the complex gate 
achieves a reduction of active transistor area (routing area is not included). Figure 
20 also shows that if the critical path ran from A to F: the complex gate should not 
be formed. For purposes of comparison in this example, it is assumed that all 
inverters that must be added to some inputs of the complex gate (for the designs of 
Figure 20( a) and Figure 20(b) to be equivalent) are pushed back to the previous 
stages. 
Rise Fall Avg. 
Path Time time 
Time 
(n1) (n1) (n1) 
A->F 8.3 7.7 8.0 
C->F 5,5 4.0 4.75 
(a) Active Transistor Area: 96um 2 
Rise Fall Avg. 
Time time Time Path (n•) (n1) (na) 
A->P 15.6 5.0 10.3 
C->P 3.4 5.0 4.2 
(b) Active Transistor Area: 72um 2 
All transistors have size 12um 2 
Figure 20. Area/Delay Considerations for 2-Level Complex Gates 
Gates along the non-nit ical pn r h:::; are combined into complex gates in the 
fourth phase. The worst case path to each output is processed first to pren~nt the 
initial combining from being along thP short paths. Figure 21 ~hows the delay and 
area for a complex gate with more transistors than Figure 20. It demonstrates that 
complex gates with more transistors are slmver for some paths: but yield a savings 
m transistor area. Note that the complex gate of Figure 21(b) could be made just 
as fast as its individual gate implementation. This would require larger transistor 
lliM P..a Ava. 
Path Time time Time (rw) (•) (m) 
A·>P 12.0 10., 11.25 
C·>P 8.3 1.1 8.0 
0->P 
'·' 
u 05 
(a) Active Tmnsistor Area: 1 ~um z 
Rise P..a Av9. 
Path Time time Time (M) (•) (ftl) 
A·>P 15.'4 
'·" 
n.oo 
C·>P lt.11 3,56 7.33 
0->P s.11 3 .. u U2 
(b) Active Tmnsislor Ala: 96uma 
All transistors have size 12um 2 
Figure 21. Area/Delay Considerations for 3=Level Complex Gates 
84 
sizes, however, and hence the complex gate consumes a larger layout area in order 
to match the delay. Therefore, paths \vith much smaller delays than the critical 
path can have complex gates with more transistors. 
Algorithm 2: C.Omplex Gate Generation 
{combining gates into complex gates} 
Let t be sink node and s be source node in the graph; 
Ci be the input capacitance of vertex i associated with all the fanout pins; 
Si be the slack of vertex i (required delay - actual delay); 
C_high and C_low be the fanout constrain ts; 
S......small and S_large be the slack constrain ts; 
Q be a set of vertex. 
PROCEDURE combining(V,Q) 
BEGIN 
FOR (i = t to s in Q) 
BEGIN 
IF ( mark[i]=false) 
BEGIN 
IF (Ci > C.Jiigh OR Si < S......small) 
combine gates into two level three input complex gate; 
ELSE IF (Ci> C_low OR Si < S_large) 
combine gates into two level four input complex gate; 
ELSE 
combine as many gates as possible into complex gate; 
mark[i] = true;{where i is in V} 
END; 
END; 
5.4.3. Transistor Sizing 
• 
The fourth phase is transistor sizing. Transistor sizing is performed first on the 
critical path to ensure optimal sizes for speed. Transistors in gates along non-
85 
critical paths can then be sized. The :0i and P transistor sizes are chosen to achieve 
equal rise and fall times for the gate. The sizing algorithm proceeds from outputs to 
inputs basing size upon the capacitive load that must be driven. To achieve nearly 
optimal sizes for speed we examine the gate to be sized and its preceding stage. 
The graph of a gate's transistor size plotted against total path delay is a convex 
curve [He87]. We examine the effect of sizing on both gates, thus considering two 
curves. The point where these two curves intersect is the transistor size chosen. 
Further details of the transistor sizing algorithm can be found in [Wu VG90]. 
Algorithm 3: C-Ombine-Then-Size Strategy 
{This algorithm is to combine gates into complex gates, and then to size the transistors to obtain 
optimum speed} 
Let V ,E be the vertex and edge sets of the graph G; 
where Vis the set of all gates in G; 
and E is the set of all connections between gates in G; 
Let V critical be the vertex set of the critical path; 
BEGIN 
restru c ture_longest_p ath (V, V critical); 
find_criticaLpath(V,V critical); 
{combine gates along critical path} 
combining(V, V critical); 
{combining gates along non-critical paths} 
combining( V, V); 
{size transistors along critical path} 
transistor_.sizing(V critical,V); 
{size transistors along non-critical paths} 
transistor_.sizing(V, V); 
END • 
86 
5.5. Results 
Our synthesis tool is currently running on SUN 3 workstations under the UNIX 
operating system. Synthesized designs with complex gates and sized transistors are 
passed to LES for layout generation and then to GDT1 (BuMa85] for simulation and 
comparison. We have run a number of examples and compared our results with 
those of MISH (OCTTOOLS release 2.0). They are shown in Figure 0. To achieve 
a fair comparison, the output of MISH (which does not size transistors) was run 
through our transistor sizing routine before passing it on to LES. The LES layout 
was passed on to G DT to perform the simulation. The results for our layout driven 
synthesis algorithm (LDS) were obtained by first minimizing and factoring the 
design using MISII, then applying the timing optimization, complex gate forma~ion, 
and transistor sizing before passing the circuit to LES and GDT. Table 1. Table 2. 
and Table 3. display a number of MCNC benchmark examples with comparisons to 
MISH (OCTTOOLS release 3.1). Figure 2 compares custom layout results, Figure 3 
was generated using the MCNC standard cell library. Because of the size of these 
designs, we did not obtain actual layout results but used estimates for time (which 
we have found to be within ± 10% of the actual value) and the active transistor area 
(not including the routing area). The following commands were used to perform the 
logic synthesis in MISH. 
1 G DT is a registered trademark o{ Silicon Compiler Systems. 
87 
Com- Area Time De.sign plexity (~ (na) 
% % 
, Gales MISIJ IDS improve-
ment MISIJ IDS 
improve· 
ment 
P147 18 72,~1 62,129 14.1 11.84 11.26 4.9 
P191 17 62,304 62,708 ·0.8 16.55 11.33 31.5 
Fl 20 76,903 51,000 33.7 17.11 11.50 32.8 
F2 29 128,()oW 81,510 36.3 21.11 12.61 -40.3 
z4mlc 6" 330,038 270, 150 18.1 26.28 12.56 52.2 
Table 1. Custom Layout results for MISH and the LDS Algorithm 
88 
Active Transistor Area Time 
Design 
(unf) (ns) 
% % 
MISll LOS increase MISll LOS improve-ment 
9symml 16,848 16,884 0.2 45.03 41.61 7.7 
z4ml 5,340 6,636 24.2 20.16 16.69 17.2 
b9 7,764 14,820 91.8 25.87 20.68 20.1 
f51m-hdl 8,676 12, 168 40.2 33.35 26.30 21.1 
f51 m 10,944 14, 184 29.6 29.65 25.18 15.1 
att12 6,564 11,040 68.2 45.95 23.76 48.3 
Table 2. Custom Layout Results Using MCNC Benchmarks 
89 
Area Time 
Design 
% % 
MISll LOS increase MISll LOS improve-ment 
9symml 382 465 21.7 19.6 15.4 21.4 
z4ml 119 116 -2.5 10.1 6.6 34.7 
b9 237 338 42.6 8.9 8.4 5.6 
f51 m-hdl 216 284 31.5 10.8 10.7 0.9 
f51 m 269 295 9.7 11.1 11.3 -1.7 
att12 218 256 17.4 14.4 11.6 19.4 
Table 3. Standard Cell Layout Results Using MCNC Benchmarks 
source script 
rlib les_cus.lib 
map -m.75 
phase -g 
speed_up -w 0 
map -m.75 
The script used is the standard script provided with OCTTOOLS release 2.0. We 
found that it produced superior results for timing than the standard script in OCT-
TOOLS release 3.1. The technology file les_cus.lib contains delay and area estima-
tions for gates in the SCMOS 3 micron technology. Our algorithm also uses this 
88 
Active Transistor Area Time 
Design 
(unf) (ns) 
% O/o 
MISll LOS increase MISll LOS improve-ment 
9symml 16,848 16,884 0.2 45.03 41.61 7.7 
z4ml 5,340 6,636 24.2 20.16 16.69 17.2 
b9 7,764 14,820 91.8 25.87 20.68 20.1 
fS 1 m-hdl 8,676 12, 168 40.2 33.35 26.30 21.1 
f51 m 10,944 14, 184 29.6 29.65 25.18 15.1 
att12 6,564 11,040 68.2 45.95 23.76 48.3 
Table 2. Custom Layout Results Using MCNC Benchmarks 
89 
Area Time 
Design 
% % 
MISll LOS increase MISll LOS improve-ment 
9symml 382 465 21.7 19.6 15.4 21.4 
z4ml 119 116 -2.5 10.1 6.6 34.7 
b9 237 338 42.6 8.9 8.4 5.6 
f51m-hdl 216 284 31.5 10.8 10.7 0.9 
f51 m 269 295 9.7 11.1 11.3 -1.7 
att12 218 256 17.4 14.4 11.6 19.4 
I I ' 
Table 3. Standard Cell Layout Results Using MCNC Benchmarks 
source script 
rlib les_cus.lib 
map -m.75 
phase -g 
speed_up -w 0 
map -m.75 
The script used is the standard script provided with OCTTOOLS release 2.0. We 
found that it produced superior results for timing than the standard script in OCT-
TOOLS release 3.1. The technology file les_cus.lib contains delay and area estima-
tions for gates in the SCMOS 3 micron technology. Our algorithm also uses this 
90 
table to perform delay estimations. Generally synthesis with layout constraints pro-
duced designs up to 48 percent faster with an average of 30 percent more area. The 
area for our designs could be reduced by performing area optimizations along the 
non-critical paths. Currently no area optimizations are performed. Tables 2 and 3 
demonstrate that our algorithm improves performance for both standard cell and 
custom layout designs. 
Figure 22 displays a composite graph showing our ability to perform area/time 
tradeoffs. Our results are normalized against standard cells whose reference point is 
shown at time = 1, area = 1. Several points are noted on the graph. Point A 
Area 
1 .......... ···········1t 
0.5 1 
Standard Cell 
/ 
B .... ._.__ ____ • 
Minimal 
Transistor 
/ Sizes 
Delay 
Figure 22. Composite Graph of Experimental Results 
91 
shows that synthesis considering layout produces designs with smaller area and fas-
ter speed. Point B illustrates the capability to save even more area while having a 
larger delay than the standard cell synthesis approach. Similarly, Point C shows 
that speed can be improved further at the expense of area. The dashed section of 
the curve ends when the minimal transistor sizes are used (PFET size = 2, NFET 
size = 1). This part of the curve shows expected results as we have not actually 
tested this. Another observation is that using custom layout, the speed can be kept 
speed constant while varying the output load (fanout). Of course, there is an asso-
ciated increase in the area. As the fanout increases in standard cells, however, the 
speed is decreased. 
90 
table to perform delay estimations. Generally synthesis with layout constraints pro-
duced designs up to 48 percent faster with an average of 30 percent more area. The 
area for our designs could be reduced by performing area optimizations along the 
non-critical paths. Currently no area optimizations are performed. Tables 2 and 3 
demonstrate that our algorithm improves performance for both standard cell and 
custom layout designs. 
Figure 22 displays a composite graph showing our ability to perform area/time 
tradeoffs. Our results are normalized against standard cells whose reference point is 
shown at time == 1, area == 1. Several points are noted on the graph. Point A 
Area 
Standard Cell 
/ 
B ... -------• 
0.5 1 
Minimal 
Transistor 
/ Sizes 
Delay 
Figure 22. Composite Graph of Experimental Results 
91 
shows that synthesis considering layout produces designs with smaller area and fas-
ter speed. Point B illustrates the capability to save even more area while having a 
larger delay than the standard cell synthesis approach. Similarly, Point C shows 
that speed can be improved further at the expense of area. The dashed section of 
the curve ends when the minimal transistor sizes are used (PFET size = 2, NFET 
size = 1). This part of the curve shows expected results as we have not actually 
tested this. Another observation is that using custom layout, the speed can be kept 
speed constant while varying the output load (fanout). Of course, there is an asso-
ciated increase in the area. As the fanout increases in standard cells, however, the 
speed is decreased. 
92 
CHAPTER 6. 
l\1ICROARCHITECTURE OPTII\1IZATION 
6.1. Introduction 
This chapter examines methods for performing microarchitecture optimization, 
then presents algorithms for incorporating these methods. 
6.2. Types of l\ficroarchitect ure Optimization 
The goal of microarchitecture optimization is to optimize the design for 
area/time without changing the state assignment. This section describes the types 
of optimizations that can be performed. 
6.2.1. IVlinhrization 
This type of optimization should be one of the first to be applied. It reduces 
the number of components or the amount of logic in a component. Figure 23(a) 
and Figure 23(b) show examples of minimization rules. Figure 23( a) shows the 
removal of the redundant signal A as an input to the multiplexor. Figure 23(b) 
shows the replacement of an adder by the sum of its two constant values. 
93 
.. 
A MUX4 A MUX3 
B ) c B 
A c 
so S1 S2 S3 so 
53 S1S2 
(a) 
4 5 
\ I 
9 
(b) 
Figure 23. Minimization Rules 
6. 2. 2. Factorization 
Factorization is used to extract early arriving signals in order to speed up late 
arriving ones. It may also be necessary to factor components in order to meet the 
requirements of a layout module generator. For example, module generators may 
94 
only be able to construct 4 to 1 or :2 to 1 multiplexors. Figure :2-4: illustrates the fac-
torization of a multiplexor. 
Pr<)~.edure 4.1 describes the factoring algorithm. The algorithm factors a single 
component having IRi inputs~ R being the set of all required input to output delays. 
The procedure Factor is recursive and takes five parameters: 1) c, which indicates 
which input of the parent component the factored out inputs should be connected 
to, 2) the set ~ 3) n, the maximum number of inputs to be factored out of the 
parent component, 4) .s, the size of the last component that failed to meet the con-
strain ts, and 5) C0 , the component to be factored. 
A 
B 
c 
D 
MUX4 
so 81 82 83 
A 
8 
) B 
c 
MUX3 
8 MUX2 
8 
D 
83 
Figure 24. Factorization of Microarchitecture Components 
8 
95 
Let: R={r I required delay from input to output}; 
each component Ci has delay di and si inputs; 
C0 be the component to be factored 
n=number of component inputs that still need to be assigned; 
s=last tried component size that failed to meet the constraints; 
D 3 =smallest delay through any multiplexor that can be generated by the database 
Function Factor( c,R,n,s,C0 ) 
Begin 
start: 
n1 = n; 
R1 = R; 
C1 = fincLnew_component(R, n, s); 
if (<?J * <P) 
it(c> 0) 
assign cl to the cth input of co 
for(i= 1;iSs 1 ;i++) 
if(min(r-d 1)> D && ((n-s 1+i)> 1)) 
n = n - Factor(i,R={r~=r-d1 },n-s 1 +i,s 1 , C1); 
else 
r, = smallest r in R; 
assign r
3 
to i-th input of C1; 
R = R - {r }; 
J 
n = n - 1; 
i:f(n==O) return(n1); 
if(!Ri==n) 
/* Not able to assign all inputs, try again * / 
n = n1; 
s = s ; 
R= R· l' 
goto start; 
return( n t - n) ; 
End 
Function find_new_componmt(R,n,s) 
Begin 
largest_allowable_delay = min(r); 
max_number_oUnputs = min(s-1, n); 
if (there exist database components C. such that 
si S ma.x._rrnmber_of_inputs &&
1
di S largest_allowable_delay) 
elseselect component C1 such that S l ~ all Si 
c1 = q;; 
return( C 1); 
End 
Proced lll9e 4.1 
96 
The factoring algorithm begins by sorting the set of required delays~ ~ from 
smallest delays to largest delays. Then the database is queried to find the same 
type of component but with fewer inputs. For exa~ple, consider Figure 2.5. In Fig-
ure 25. the database is shown to have returned three components ha\·ing six or 
fewer inputs. The 2-input multiplexor has a delay of 2ns 1 the 4-input multiplexor 
has a delay of 5ns, and the 6-input multiplexor has a delay of 7ns. Figure 25 shows 
the factoring process for a six to one multiplexor. The set of required delays, R, is 
shown to be (5, 5, 6, 6, 7, 9) for inputs A through F, respectively. Since the six 
input multiplexor did not meet the required delays, the next smallest one 1s 
selected. In this case it is the four input multiplexor. 
The next stage is to assign the inputs to this new component. All inputs whose 
required delays will not be satisfied if they are factored out (ie., the delay through 
the new component + the smallest possible delay through any component of the 
same type) are connected directly to the new component. The remaining signals 
represent those that can be factored out. The algorithm queries the database again 
to find the component with the most inputs that will still meet the timing con-
straints when the signals are extracted. This component will then be processed 
recursively in a similar manner. When a solution is found that meets the time con-
straints, the algorithm ends. 
For the example of Figure 25, input A cannot be factored out of the 4 to 1 mul-
tiplexor or its timing constraint of 5 will not be met. That is, the delay of the four 
97 
List of multiplexors returned by the component database: 
(size, delay)= (2,2) (4,5) (6, 7) 
ABCDEF 
Set of required delays A = (5,5,6,6, 7,9) 
A B C D E F 
(a) Initial Implementation 
@@ 
A A B C D 
(b) (c) 
A B 
(d) 
A B C 
(e) 
c C D E F 
(f) (g) 
Figure 25. Exa:r:nple of Factorization 
98 
input multiplexor (with delay of 5) plus the delay of the smallest multiplexor ( 2-
input MUX with delay of 2) is greater than the required delay of 5 for input A. For 
this reason, input A is connected directly to the 4-input multiplexor ( Figure 
25(b)). The set R then becomes (5, 6, 6, 7, 9) with only five more inputs to be 
assigned. For similar reasons, inputs B, C, and D are assigned directly to the 4-bit 
multiplexor as shown in Figure 25( c ). At this point, not all inputs have been 
assigned and there are no unused multiplexor input ports. Therefore, using the 4 to 
1 multiplexor has failed. The set R is reset to the original (5, 5, 6, 6, 7, 9) and 
another attempt is made using a smaller multiplexor. If a 2 to 1 multiplexor is 
used, inputs A and B can be factored out using a second 2 to 1 multiplexor ( Figure 
25(d)). The time constraints are still met and the ne'?' se.t Ris (6, 6, 7, 9). The 
largest multiplexor that can be used to factor out input C is a 2 to 1 multiplexor ( 
Figure 25( e) ). In addition, input C can be factored out again using another 2 to 1 
multiplexor and the time constraints are still met ( Figure 25( f) ). In a similar 
fashion: inputs D, E, and F can be assigned as in Figure 25(f). 
6.2.3. Swap F.quivalent Signals on the Same Component 
If two signals on a component are interchangeable and one has less delay than 
another, the early arriving signal can be swapped with the late arriving signal. Fig-
ure 26( cl) demonstrates how this ca? be accomplished. Swapping of component pins 
can be described as follows. Let I( c) = {i1 I j = l..n} be a set of equivalent inputs to 
a component c, where i. = {a., r., s.}. a. is the arrival time, r. is the required time, ) ) J J J J 
99 
A D 
B 
=> 8 c F F c D A 
A-> F 2.0ns 
D -> F 3.0ns 
Signal D is on the critical path 
Figure 26. Signal Swapping. 
and s. = r. - a. is the slack. J ) ) 
Let T = {i. Is. < 0 j = 1..n} be a set of critical inputs, N = {i. Is. ~ 0 J. = 1..n} ) ) . ) ) 
be a set of non-critical inputs. Sets T and N can then be sorted according to each 
pin's slack. Swapping of pins then takes place as shown in Procedure 4.2. The algo-
rithm tests whether a pin from T can be swapped with a pin from N. If doing so 
does not create a new critical path, the pins are swapped. 
6.2.4. Merge Similar Units 
Two components can be merged when one of them performs a subfunction of 
the other. For example, in Figure 27, combining a register and shifter into a 
100 
Let ABS() be the absolute value function 
Procedure Swap_Fins ( T, N) 
Begin 
k=O 
For j=O to tI1 
Begin 
i. = the j th pin of T; 
J 
s1 = the slack of pin ii; 
ik = the kth pin of N; 
sn = the slack of pin i.1;; 
If ABS(s1) :5 ABS(sJ then 
Begin 
End 
End 
swap(ii' i.1;); 
k = k + 1; 
End If 
Procedure 4.2 
101 
N 
N-bit Register N-bit Shifter MUX 
0 a 0 10 N 11 0 
12 
Clock Clk SHLSHR 13 
Shift left 
0102 
Shift right so S1 S2 S3 
MUX N-bit Shift Register 
01 10 
N 
0 D a 
02 11 Clock lk 
SHLSHR 
S2 S3 
Shift left 
Shift right 
Figure 27. Merge Similar Units 
102 
register that performs a shift. ::vierging rules examme connectivity between tvrn 
components and their functionality. Functionality of components can be found by 
querying the database for a list of functions that the microarchitecture component 
performs. The merge can be performed for two components, c0 and cl' when 
function(c0) C function(c 1). For example. in Figure 27, the function shift is a func-
tion that can also be performed as part of the register component. Thus a register 
that does not perform a shift and a shifter can be combined into a single shift regis-
ter. 
Merging of similar components is accomplished in two subphases: 1) same type 
component merging, and 2) different type component mergmg. Sarne type com-
ponent mergmg 1s accomplished by an algorithm that proceeds from the design's 
input pins to the design's output pins, examining whether two components that are 
of the same type are connected together ( eg., two multiplexors, two adders, etc). 
The algorithm checks a list of valid component types for merging. If a match 1s 
found, the merging procedure continues, otherwise the next set of components 1s 
examined. Then, there are three cases that occur when merging components: 
(1) A component is only connected to a component of similar function to itself as 
in Figure 28( a). 
(2) A component is connected to multiple components, some of which are of a simi-
lar function, some of which are of a different function. An example of this is in 
Figure 28(b ). 
103 
A. 2 to 1 MUX 2 to 1 
MUX 
B C1 
c C2 
(a) 
c Adder 
2 to 1 
A MUX C3 
B C1 2 to 1 MUX 
D C2 
(b) 
E 2 to 1 MUX 
2 to 1 
A MUX C2 
8 C1 3 to 1 MUX 
c 
D C3 
(c) 
Figure 28. Three Possible Merging Cases 
104 
(3) A component is connected to multiple components. all of which are of the same 
function type. 
For case 1 occurrences, the two components are merged. Thus the design of 
Figure 28( a) becomes the design of Figure 29( a). In a case 2 occurrence, the two 
components c 1 and c2 are merged to create a new component, but c1 must remain 
connected to those components which are of different types. Thus the design of 
Figure 28(b) becomes that of Figure 29(b ). In case 3 occurrences, component c1 is 
merged with all components that its output is connected to. For example, the 
design of Figure 28( c) becomes the design of Figure 29( c). Though the design of 
Figure 29( c) is more expensive than that of Figure 28( c) it is used as an intermedi-
ate step in optimization. This is discussed in further detail later. 
Merging of different type components is performed usmg rules. There is one 
rule for each t.ype of merge operation. For example, a rule to perform the optimiza-
tion of Figure 27 is shown in Figure 30. If the connectivity of the components is 
found to be similar to that of Figure 27, then the component database is queried to 
produce the new set of components which are substituted into the design. 
6.2.5. Merge Unsirrilar Units 
Two components can be merged into a single component that performs a 
different function than any of the original units. For example, combining a register 
and incrementer into a counter, as in Figure 31. Rules for this type of merging are 
A 
B 
A 
B 
c 
2 to 1 
MUX 
A 
8 
E 
A 
8 
c 
D 
105 
3 to 1 
MUX 
(a) 
c 
A 
B 
D 
(b) 
3 to 1 
MUX 
4 to 1 · 
MUX 
(c) 
Adder 
3 to 1 
MUX 
Figure 29. Results of Merging 
106 
If there is a component C1 with functionality = register 
AND there is a component C2 with functionality = shifter 
AND outpu~ Q of C1 is connected to input I of component C2 
AND there 1s a component C3 with functionality = multiplexor 
AND output Q of C1 is connected to input I of C3 
AND output 0 of C3 is connected to input D of C1 
Then 
C4 = Query Component Database for a shift register 
CS = Query Component Database for a multiplexor with two 
fewer inputs than C3 
Replace C1, C2, and C3 with C4 and CS 
Figure 30. Rule for Merging 
Register lncrementer Counter 
> Clock a 0 
R 
Reset Reset 
Figure 31. Merging Unsimilar Units 
similar to those for merging similar functional units. In this case, however, for two 
components, c0 and cl' function( c0) U function( c1) ~ function( C1), where C1 is a com-
ponent that can be generated by the component database. For example, in Figure 
107 
31 the register and incrementer are both subfunctions of a counter component that 
can be generated by the database. ::V'fergeability can be determined by querying the 
database with a list of functions desired in a component to determine if such a com-
ponent can be created. 
6.2.6. Style Change 
The optimizer can query the component database to request a cornponen t that 
performs the same function( s) but is faster or has a smaller area. The database 
returns a list of components from which the optimizer can select one based on the 
time and area requirements. Part of the database query can include a layout style 
request. For components having a bit-sliceable architecture, such as ALU s, the 
optimizer will request a bit-sliced layout style. By placing the component in the 
bit-sliced datapath, routing area can be reduced. As mentioned earlier, bit-slices 
usually tend to be faster and smaller than their equivalent random-logic implemen-
tation. Transistor sizes in the designs produced by layout module generators are 
fixed, however. In some cases larger transistor sizes may be required for drive capa-
bility and speed. Buffers can usually be inserted to add greater drive capability. 
For greater speed, however, larger transistor sizes for gates in a design typically 
decrease the delay. Thus producing a random-logic design with larger transistor 
sizes than those used in the bit-sliced cell may result in a faster design. As stan-
dard cells have fixed transistor sizes, a transistor sizing program, such as 
[Wu VG90), can be combined with a custom-layout generator to produce the layout 
108 
for the component. Estimators that calculate delays for bit-slice logic. based on a 
single slice, and for random-logic, based on gate type and transistor sizes. assist the 
microarchi tecture optimizer in determining which design will be faster. 
The database searches through its list of different architectural styles for the 
component to select one that it estimates will come closest to meeting the specified 
constraints [ChGa90]. For each style, the database maintains a range of delays and 
area that can be obtained. Then, depending on the layout style, the database can 
call tools such as logic optimizers, transistor sizing tools, etc., to generate the low 
level design in terms of gates or a layout. In this manner~ the microarchitecture 
optimizer is freed from the low level details and is not concerned with which low 
level optimization tools should be called. 
6.2. 7. Duplicate Logic 
Duplication of components is a technique designed to improve the speed Of a 
path at the cost of additional area. It is the reverse of factorization. Figure 32 
shows the duplication of the two-input multiplexor in order to reduce the delay 
along a critical path. 
6.2.8. Merge Multiple Gnrp>nents and Optimize 
This technique combines components performing different functions into a sin-
gle unit and then applies logic optimization. Optimization of this type can be par-
109 
c MU 2 UX3 c 
MUX2 A A 
> B B 
MUX2 
Figure 32. Component Duplication 
ticularly effective when some of the inputs to the components are constants. The 
optimization of the constants will propagate through the logic. Thus in cases where 
the microarchitecture optimizer believes constant propagation in the logic \Vill 
occur, it will merge even bit-sliceable components, optimize them, and treat them as 
random logic. Constant propagation is obvious when a number of the component's 
inputs are constants. Components connected to the output of such a component 
should also be combined into the random logic since the constants can usually be 
propagated through several levels of microarchitecture components. 
llO 
6.2.9. Extraction of Corrnn:m Subexpressions 
Designs can often have the same logic duplicated in different parts of the 
design. Local transformations will not detect this. Therefore, global analysis is 
required to find and extract such common subexpressions. An example of common 
subexpression extraction is shown in Figure 33. 
Common subexpression elimination is performed for each component type. For 
example, it will be performed separately for multiplexors and adders. The algorithm 
consists of three steps: 1) for each component which is of the selected component 
type, a set N is generated that contains all the inputs to that component, 2) a set L 
of possible subexpressions is generated, 3) a common subexpression is selected and 
A 
F1 F1 B 
> A 
A B 2 
8 F2 F2 
c 
Figure 33. Common Subexpression Extraction 
111 
extracted. This process is repeated until no more subexpressions are present. 
As an example: consider Figure 34. In this example: the component type is 
adder. For each adder (components cl' c2 , c3 , and c4), the set Ni is generated. Every 
net is assigned a unique id number (for example, nets n12 and n23 in Figure 34). 
Two components that have an input connected to the same net have that net id 
number in common. Each set N is sorted by the net id numbers. From Figure 34, 
the N sets generated are: N 1 = {n9, n12, n23}, N 2 = {n12, n23}, N 3 = {n9, n12, n16, 
n23}, and N 4 = {n9, nl5, n16}. 
The second step is to identify possible common subexpressions. A common 
subexpression Se is present if the following expression is true: Ni n Njl > 1. A set of 
common subexpressions, Sii' is generated for each Ni n Nj. Each set Sii must have at 
least two elements or be the null set as only two or more inputs can be extracted. 
From the four N sets generated above, the sets Sii are as follows: S 12 = {n12, n23}, 
S 13 = {n9, nl'.2, n23}, S 14 == r/> , 8 23 == {n12, n23}, 8 24 = ¢ , and S 34 == {n9, n16}. A set 
L is created from the S sets. It contains no duplicated entries but instead keeps a 
count of the number of occurances for each subexpression. Thus the set L is 
{{n12,n23}:2, {n9,n12,n23}:1, {n9,n16}:1}. 
From L a subexpression for extraction is chosen using the following criteria: a) 
most number of occurances, and b) smallest subexpression. In the case of Figure 34, 
the set in L with the largest number of occurances is {n12,n23}. Figure 35( a) shows 
the new design after the common subexpression is extracted. The set {n12,n23} is 
n9 
112 
n23 
n12 
n16 
n15 
N1 = {n9, n12, n23} 
N2 = {n12, n23} 
ADD 
C1 
ADD 
C2 
ADD 
C3 
ADD 
C4 
N3 = {n9, n12, n16, n23} 
N4 = {n9, n15, n16} 
S1 
S2 
S3 
S4 
Figure 34. Common Subexpression Elimination 
113 
• I 52 
n12 ADD 
n30 ADD 
n23 51 
n9 
ADD 
53 
n16 
ADD 
54 
n15 
(a) 
52 
n12 ADD 
30 ADD 
n23 51 
n9 
ADD 
n16 53 
ADD 
54 
n15 
(b) 
Figure 35. Common Subexpression Elimination Example 
114 
removed from L and any set containing {nl:2 1n23} as a subset (for example the set 
{n9, n12, n23}) is replaced with the net id for the extracted subexpression (in this 
case, n30 as shown in Figure 35). The new set L is {{n9, n30}:1, {n9,nl6}:1}. Since 
both sets have the same number of occurances and the same size, the first set 
{n9,n30} is selected. Figure 35(b) shows the design after the extraction of this set. 
The new set L for the design of Figure 35(b) is { r/> }. Therefore there are no more 
subexpressions to extract. 
6.2.10. Addition of Buffers 
Some components that drive large loads may require the addition of buffers at 
their outputs. This can reduce the delay by providing greater drive power. 
Methods of doing this have been discussed in [GuPa90] [SiSV90]. One solution is to 
partition the load by constructing a fanout tree from buffers. This tree should be 
constructed in a manner that does not violate the time constrain ts yet minimizes 
the amount of area increase. 
[GuPa90] also mentions that in standard cell designs, components with higher 
drive capacity can be selected. Alternatively, some duplication of logic can be used 
to reduce fanout. In our case, the component database contains tools for transistor 
sizing and can generate a layout using a custom layout generator. This allows the 
design to be more finely tuned than when using standard cells. With the custom 
layout capability component transistor sizes are not fixed at discrete intervals. 
Rather, transistor sizes can be selected on a continuous basis in order to meet delay 
115 
requirements. 
6.3. Strategies for l\1icroarchitecture Optimization 
Having examined types of optimization techniques, we now describe an algo-
rithm for applying them. A block diagram of the optimization process is shown in 
Figure 36. It is divided into three parts: a blackboard, a control section, and a set 
of optimization procedures. The blackboard contains the design netlist, statistics 
for delay and area, a set of user constraints, a set of critical paths, and a set of 
non-critical paths. Critical and non-critical path sets are determined by the timing 
analyzer in the control section. The controller also selects which optimization pro-
cedure to use. Each optimization procedure corresponds to one phase of the optimi-
zation process. The microarchitecture optimization is carried out in four phases: 
general design improvement, random logic grouping, timing optimization and area 
optimization. 
Phase 1 of the algorithm, general design improvement, attempts to reduce the 
number of components and to prepare the design for timing optimization should 
that be necessary. For example, to be able to refactor multiplexors along the critical 
path, all multiplexors that can be merged should be merged in to a single multi-
plexor. Then the timing optimizer can decide -how to refactor the single multi-
plexor. Thus optimizations in this phase set up techniques that will be performed 
later or employ techniques that improve both the time and area of a design. 
User-Constraints 
116 
Optimization 
Controller 
Area Optimizer 
Figure 36. Overview of Microarchitecture Optimization 
Phase 2 groups random logic components for logic optimization. Microarchitec-
ture optimizations are not performed on random logic gates. Instead, they are 
passed to the database which has tools for restructuring the logic to meet a set of 
constraints passed by the microarchitecture optimizer. Thus Phase 2 prepares the 
design for Phases 3 and 4 by reducing the number of components that the microar-
chitecture optimizer must deal with. In doing so, it groups components that will be 
117 
implemented using random logic gates rather than a bit-sliced layout. 
Phase 3 applies time reduction techniques. The microarchitect ure design is ori-
ginally tuned for area by the technology mapper. Therefore, in this phase, the 
microarchitecture optimizer operates on critical paths, making necessary time for 
area tradeoffs. 
Phase 4 works on non-critical paths, attempting to reduce the design's area. 
During area optimization, some microarchitecture components may be merged with 
others for logic optimization. These types of optimizations must be performed after 
timing optimization because once components are merged, the microarchitecture 
optimizer cannot recognize the original component functionality. This information 
is necessary for some of the timing optimization techniques. 
Procedure Optimize-1\.1icroarchitecture ( microarchitecture design) 
Begin 
General_Design_Improvemen ts( microarchitecture_design) 
Random_Logic_Grouping( microarchitecture_design) 
Identify critical path set 
\Vhile (Critical path set is not empty) 
Begin 
criticaLpath == select_criti-cal_path( criticaLpath_set) 
Timing_ Optimization( criticaLpath) 
Remove criticaLpath from criticaLpath_set 
End 
Identify non critical path set 
\Vhile (Non critical path set is not empty) 
Begin 
non_criticaLpath == selectJlon_criticaLpath(non_critical_path_set) 
Area_Optimization( non_criticaLpath) 
Remove non_criticaLpath from non_criticaLpath_J3et 
End 
End 
User-Constraints 
116 
Optimization 
Controller 
Area Optimizer 
Figure 36. Overview of Microarchitecture Optimization 
Phase 2 groups random logic components for logic optimization. Microarchitec-
ture optimizations are not performed on random logic gates. Instead, they are 
passed to the database which has tools for restructuring the logic to meet a set of 
constraints passed by the microarchitecture optimizer. Thus Phase 2 prepares the 
design for Phases 3 and 4 by reducing the number of components that the microar-
chitecture optimizer must deal with. In doing so, it groups components that will be 
117 
implemented using random logic gates rather than a bit-sliced layout. 
Phase 3 applies time reduction techniques. The microarchitect ure design is ori-
ginally tuned for area by the technology mapper. Therefore, in this phase, the 
microarchitecture optimizer operates on critical paths, making necessary time for 
area tradeoffs. 
Phase 4 works on non-critical paths, attempting to reduce the design's area. 
During area optimization, some microarchitecture components may be merged with 
others for logic optimization. These types of optimizations must be performed after 
timing optimization because once components are merged, the microarchitecture 
optimizer cannot recognize the original component functionality. This information 
is necessary for some of the timing optimization techniques. 
Procedure Optimize_Microarchitecture ( microarchitecture design) 
Begin 
GeneraLDesign_Improvemen ts( microarchitect ure_design) 
Ran dom_Logi c_ Grouping( mi croarchi tect ure_design) 
Identify critical path set 
"7ltlle (Critical path set is not empty) 
Begin 
criticaLpath = select_criti-caLpath( criticaLpath_set) 
Timing_Optimization( criticaLpath) 
Remove criticaLpath from criticaLpath_set 
End 
Identify non critical path set 
\Vhile (Non critical path set is not empty) 
Begin 
non_critical_pat h = select_non_criticaLpath( non_criticaLpath_set) 
Area_Optimization( non_criticaLpath) 
Remove non_criticaLpath from non_criticaLpath_.set 
End 
End 
118 
6.3.1. General Design Improverrents 
To improve the overall design, Phase 1 proceeds as follows: 
Procedure GeneraLDesign_Improve:rrents ( microarchi tecture design) 
Begin 
Merge Similar Units (from inputs to outputs of the design) 
Merge UnSimilar Units (from inputs to outputs of the design) 
Apply Minimization Rules (from inputs to outputs of the design) 
Perform Common Subexpression Elimination 
End 
Phase 1 begins with components of similar types being merged. As mentioned 
earlier, this is necessary for refactoring and common subexpression recognition. 
Further it reduces the number of components and hence makes minimization rules 
easier to apply. For example, performing the optimization of Figure 23 ·would be 
more difficult to discover if the multiplexor containing the common signal A were 
factored into two multiplexors, each containing the signal A. 
Next unsimilar components are merged usmg a set of rules. These rules also 
reduce the number of components in the design. Once all merging is complete, a set 
of minimization rules can be applied to clean up redundant and unnecessary logic in 
the microarchitecture design. Up to this point, all optimizations reduce the number 
of microarchitecture components in the design. The next step, common subexpres-
sion extraction, increases the number of microarchitecture components but reduces 
the actual amount of logic required to implement them. It allows hardware that is 
119 
redundant in a number of components to be shared. Common subexpression elimi-
nation is not performed on random logic as this can be performed by logic optimiza-
ti on tools. 
6.3.2. Random Logic Grouping 
Phase 2 collects certain types of components to be optimized together as ran-
dom logic. This is performed as follows: 
Procedure RandonLLogic_Grouping (microarchitecture design) 
Begin 
Group Logic Gates into random logic compon~nts 
Group Non-Bit-Sliceable Components into random logic components 
Group Components with Constant Inputs into random logic components 
End 
Phase 2 groups components that will then be optimized as a single random 
logic component. During this phase, three types of components can be grouped: 1) 
gates, 2) components for which bit-slicing is difficult, and 3) bit-sliceable com-
ponents that have constants as inputs. Type 1 components, gates such as NAND 1 
AND, and XOR, each have a lower-level technology specific implementation. For 
example, at the microarchitecture level, one could have a 12-input AND gate. Of 
course, a gate with this many inputs is usually not physically implementable as a 
single gate. Thus the technology-specific design is constructed from smaller gates 
that are available in the specified technology. Phase 2 groups all random logic gates 
at the microarchitecture level that are connected together and forms a single 
120 
component of type "random logic". as shown in Figure 37. 
Type 2 components 1 for which bit-slicing is difficult~ such as a decoder~ will 
also be grouped with the random logic gates that they are connected 'lvith. Finally~ 
bit-sliceable components with constant inputs will be added to the random logic set. 
Components connected to the outputs of the type 3 components will also be 
grouped into the random logic since during logic optimization, the constants will 
often propagate through. 
For each microarchitecture component, the database has a file containing a set 
of boolean equations that describe the behavior of the component. As mentioned 
earlier, the equations can represent sequential logic as well as combinational logic. 
A 
D a F1 
A Random D a F1 Clk B Logic Clk c 
0 
B E 
=> 
F D a c F2 
D a 
. -
. F2 Clk 
Clk GLJ-G Logic H H I F3 
I F3 J 
Figure 37. Random Logic Grouping 
121 
The microarchitecture optimizer can request that the database create a new com-
ponent by merging two component's equation files. Logic optimization on this new 
component can then be performed by tools in the database. 
6.3.3. 'Ilming Optimization 
Timing analysis is performed as the first stage of Phase 3. Delays and setup 
times for each component can be found by querying the database. The timing 
analyzer calculates four types of ·worst delay: 1) input pins to registers, 2) register to 
register, 3) registers to output pins~ and 4) input pins to output pins. The worst 
delays at the design's output pins are compared with the required delays that are 
entered by the user. Output pins with negative slacks do not meet the delay con-
straints. Slack is computed as: 
slack = actual delay - required delay 
Required delays are also calculated at each register data input. The required 
delay is calculated based on the required maximum clock width, which is entered by 
the user: 
required delay = max clock width - setup time 
Actual delays are calculated based on the worst delay to the register's input, 
the setup time for that input, and the worst delay to the clock input of the register: 
122 
actual delay = worst delay to register input + set up time -
worst delay to register clock input 
The slack is then computed from the actual and required delay values. A slack 
value is found for every component :s output pin in a similar manner by subtracting 
the actual delay from the required delay. 
After timing analysis, the goal of timing optimization is to make sure that no 
component's outputs have a negative slack value. Any component having such an 
output is said to be on a critical path. Ideally these negative slack values are raised 
to zero, with any value over zero _representing over optimization (assuming 
area/time tradeoffs must be made). 
Timing optimization is performed for each critical path. The worst critical 
path (ie., the one having the largest negative slack) is processed first. Timing 
optimization along the critical path proceeds as follows: 
Procedure Timing_Optimization (Critical Path) 
Begin 
Swap Equivalent Signals 
Factor 
New Component Style Selection 
Merge Multiple Components for Optimization 
End 
Phase 3 operates on microarchitecture components along the critical paths. It 
uses factoring, signal swapping, new component selection, and merging of com-
ponents in order to reduce delay. The timing optimization phase ends as soon as 
123 
there are no critical paths. 
Signal swappmg is performed first smce there is no area increase associated 
with it. l.Jsually, hmvever, improvements in delay from this type of optimization are 
small. Factoring is employed in the second step to produce shorter paths for critical 
signals. This technique usually increases the area only slightly. The set of required 
delays is calculated for each input to the output of the factorable component. 
These delays are passed to the factoring routine which then attempts to factor in a 
manner that will meet those delays. 
In the third step of Phase 3: the optimizer selects new component styles by 
querying the database to find out what components are available with smaller 
delays. The component that comes closest to satisfying the required delays at each 
output is selected. Thus the optimizer tries to set each of the slacks at the output 
pins to zero. 
Having failed to fix all critical paths with the previous three steps, the microar-
chitecture optimizer attempts ·to combine bit-sliceable components into a random 
logic component and query the database to apply logic optimization. In addition, 
the database can use a transistor sizing program to size the transistors in a fashion 
that will meet the time constraints. By using larger transistor sizes than those used 
in the bit-sliced approach (where transistor sizes are fixed), it may be possible to 
produce a faster component. If indeed the database returns a faster component, the 
microarchitecture optimizer will switch the layout style to a custom layout. Of 
124 
course, the random logic approach combined with the large transistor sizes results 
in larger area. 
6.3.4. Area Optimization 
Finally, Phase 4 performs area optimizations along non-critical paths. It 
mainly employs new component selection and component mergmg. Components 
that have outputs with positive slacks are examined for possible area/time tradeoffs. 
Area optimization operates as follows: 
Procedure Area_Optimization (Non-Critical Path) 
Begin 
New Component Style Selection 
Merge Multiple Comps for Optimization 
End 
New component selection includes choosing a bit-sliced layout style for com-
ponents where doing so results in an area reduction. Some components, such as 
multiplexors, may need to be factored in order to use a layout module generator. 
For example, only 4-to-1 and 2-to-1 multiplexors may be available. An 8 to 1 multi-
plexor would then need to be factored. This can be achieved by the algorithm 
presented earlier. 
In some cases, a layout module generator exists but contains more functions 
than are required. For example, consider an ALU. The bit-slice of the module gen-
erator may perform addition, subtraction, eight logical functions ( eg., N AND, 
125 
A:\'D) ~ and a set of comparison functions ( eg .. equcd. great t>r than. zero). If only the 
addition operation, the comparison functions. and a logical A~D are required. the 
layout module generator performs more functions than are necessary. Thus a ran-
<lorn logic implementation will probably produce the smallest area design. If there is 
already a random logic component connected to the AL C ~ the AL l-:- can be merged 
into the random logic component and the logic reoptimized. 
An alternative approach to generating the random logic design is to separate 
the groups of functions that need to be performed. For example. Figure 38 shows 
AND 
- 2 to 1 
MUX 
-
F 
A /' Adder \. I /' 
\. ADD B 
Comparator 
-
-
> 
< 
. 
Figure 38. Option for performing ALU functions 
126 
that three groups of functionality can be generated for our example of the AL C: an 
arithmetic unit (adder), a comparator, and a logical A.:\D. A multiplexor is used to 
choose the addition function or the logical A:ND. In this case all components can be 
implemented using the layout module generators and placed in the bit-sliced data-
path during layout. 
6.4. Experirrental Results 
This section presents experiments performed using MILO. A number of design 
examples were written in VHDL, then run through VSS to generate the initial 
microarchitecture design. The designs were then run through MILO with the fol-
lowing four strategies: 
( 1) Optimize the design for area and produce an underlying gate-level design for 
each microarchitecture component. 
(2) Optimize the design for time and produce an underlying gate-level design for 
each microarchitecture component. 
(3) Optimize the design for area and use the module generators for all bit-sliceable 
microarchitecture components, gate-level designs for all other components. 
( 4) Optimize the design for time and use the module generators for all bit-sliceable 
microarchitecture components, gate-level designs for all other components. 
127 
To get an idea of how good these optimizations were compared to a traditional 
straight logic optimization, the output design from VSS consisting of microarchitec-
ture components was completely expanded into a fiat gate-level design. This design 
was run through MISII and then through a transistor sizing program. The logic 
optimization was also performed for both area and time. Thus each example was 
run six different ways. 
Five different benchmarks were run through MILO: Rockwell Counter, 
Armstrong Counter, and three different versions of DRACO: Draco2, Draco3, and 
Draco Schematic. A short description of each of these designs and their results are 
shown in the folhwing sections. In the final section a comparison of all of the 
optimizations performed by MILO and MISII is made and conclusions are drawn. 
6.4.1. Benchnnrk Experirrents 
6.4.1.1. Rockwell Counter 
The Rockwell Counter benchmark was supplied by Rockwell International and 
is a design used in telephone switching networks. It has four inputs as shown in the 
block diagram of Figure 39: 1) CLK, the system clock, 2) RST, which performs a 
synchronous reset of the counter, 3) DTI, a 12-bit data input, and 4) LDE, a con-
trol line which loads the counter with input DTI. It has only one output, DTO, 
which represents the value of the count. 
128 
Rockwell 
LOE Counter 
12 
DTI 12 
OTO 
AST 
CLK 
Figure 39. Block Diagram of the Rockwell Counter 
The counter is a divide by 3328 counter that operates as follows: 
(1) The counter has a start count of 0 and a terminal count of 3327. 
(2) The counter increases by 208 on each clock edge. If the count is greater than 
3327, the counter will start at the previous start count plus 26 ( eg., the first 
time: 0 + 26). If the previous start count plus 26 is greater than 207, then the 
count will start at the previous start count plus 1. 
(3) There are 26 sequences (ie., 26 start counts) before the counter reaches 3327 
and wraps back to 0. 
( 4) The counter has an active high load enable which synchronously loads the 
counter. The state machine must adjust to the new state so as to keep the 
129 
same counting sequence. 
Table 4 shows the optimization results for the Rocbvell Counter when optimiz-
ing for time: while Table 5 shows the results when optimizing for area. In this exam-
ple: the design with the fe·west transistors is achieved using the module generators 
during microarchitecture optimization. The area of the designs employing only 
gates are roughly equal. \~l'hen comparing time results, MILO's optimization with 
gates produced the smallest delay, followed by MILO's optimization using the 
module generators. The optimization by MISII produced the largest delay. 
Table 6 displays the tradeoff of time for area when comparing the time optim-
ized designs with the area optimized designs. The change in time and area is shown 
as a percentage. For example, MILO's optimization using only gates achieves a 37% 
Optimization Optimization time area 
Tool Style (ns) (#of transistors) 
MILO Gates 206.0 1800 
MILO Module 233.5 1484 Generators 
MISll Gates 222.5 1344 
Table 4. Time Optimizati~n Results for the Rockwell Counter 
130 
Optimization Optimization time area 
Tool Style (ns) (#of transistors; 
MILO Gates 327.0 1158 
MILO Module 337.5 1056 Gener~tnr~ 
MISll Gates 413.0 1170 
Table 5. Area Optimization Results for the Rockwell Counter 
Optimization Optimization time difference area difference 
Tool Style (%) (%) 
MILO Gates -37.0 +55.4 
MILO Module -30.8 +40.5 ~Aner~tnrs 
MISll Gates -46.4 +14.8 
Table 6. Time/ Area Tradeoffs for Rockwell Counter 
improvement in time at a cost of a 55% increase in area when comparing the time 
optimized design with the area optimized design. This table illustrates that fairly 
substantial reductions in time can be achieved at a cost of additional area. Finally, 
131 
Figure 40 compares the three optimization approaches ( l\IILO with gates, :\IILO 
with module generators and gates, and ~tfISII) graphically. The curve represents 
the potential to achieve area/time tradeoffs between the best area optimized design 
and the best time optimized design, although this ability has not actually been 
tested. 
6.4.1.2. Armstrong Counter 
The Armstrong Counter is a benchmark adapted from [ Arrns89]. As shown in 
the block diagram of Figure 41, it has four inputs: 1) CLK, the system clock, 2) 
400 
time(ns) 
300 
200 
100 
0 
generators 
+gates 
1000 1250 
MILO 
only gates 
1500 1750 
Area (transistors) 
2000 
Figure 40. Three Optimization Approaches for the Rockwell Counter 
CON 
DATA 
STAB 
132 
Armstrong 
Counter 
4 
CNT_OUT ........... -
Figure 41. Block Diagram of Armstrong Counter 
CON, a two-bit input that selects which function the counter will perform~ 3) 
DATA, a four-bit input that determines the end count for the counter, and 4) 
STRB, an asynchronous line that loads the two values DATA and CON into regis-
ters. It has a single four-bit output, CON_OUT. 
The behavior of the Armstrong Counter is as follows: 
(1) On the rising edge of STRB, the values of DATA and CON will be loaded. 
(2) The counter can perform four functions as specified by the value of CON: clear 
the counter, load a limit register, count up to a limit, or count down to a limit. 
Table 7 shows the optimization results for the Armstrong Counter when optim-
izing for time, while Table 8 shows the results when optimizing for area. In this 
• 
133 
• I 
Optimization Optimization time area 
Tool Style (ns) (#of transistors) 
MILO Gates 28.0 486 
MILO Module 20.0 393 Generators 
MISll Gates 43.5 484 
Table 7. Time Optimization Results for the Armstrong Counter 
Optimization Optimization time area 
Tool Style (ns) (#of transistors; 
MILO Gates 38.0 486 
MILO Module 20.0 395 Generators 
MISll ·Gates 74.5 460 
Table 8. Area Optimization Results for the Armstrong Counter 
example~ MILO's optimization with module generators produced the smallest delay, 
followed by MILO's optimization using the gates. The optimization by MISH pro-
duced the largest delay. When comparing area results, the design with the fewest 
134 
transistors 1s agam achieved using the module gene rat ors during microarchitect ure 
optimization. The area of the .:V1ISII design is smaller that that produced by MILO 
using gates. 
Table 9 displays the tradeoff of time for area when comparing the time optim-
ized designs \Vi th the area optimized designs. The change in time and area is shovm 
as a percentage. Optimization by MILO using only gates shows a 263 reduction in 
delay with no increase in transistor count. This indicates that the improvement in 
time was mainly due to changes in transistor sizing. Finally, Figure 42 compares 
the three optimization approaches (MILO \vith gates, MILO with module generators 
and gates, and MISII) graphically. Again, the curve represents the potential to 
achieve area/ time tradeoffs between the best area optimized design and the best 
Optimization Optimization time difference area difference 
Tool Style (%) (%) 
MILO Gates -26.3 +0.0 
MILO Module Generators -0.0 -0.5 
MISll Gates -41.6 +5.2 
Table 9. Area/Time Tradeoffs for the Armstrong Counter 
100 
time(ns) 
75 
50 
25 
0 
375 
135 
MILO~ < 
· modui$ ~,:~,.-. 
.. .. tlenefafor~; 
.; + gat~~. 
400~: 4Z5 
Area (tra.Qsistors) 
MILO 
only gates 
500 
Figure 42. Three Optimizati9n Approaches for A.'~mstrong Counter 
time optimized design. For the Armstrong Counter, MIS II has the largest distance 
between the area and time optimized designs. 
6.4.1.3. DRACO 
DRACO is another benchmark obtained from Rockwell International and is the 
most complex of all our benchmarks. A block diagram of DRACO is shown in Figure 
43, consisting of nine inputs and one output: DRACO is primarily intended to inter-
face 16 I/O ports to a microprocessor's 8-bit multiplexed address/data bus and con-
trol signals. DRACO was developed by Rockwell as an ASIC chip. 
136 
16 Draco 
DATA BUS 
PARITY 
POWER 
CE_L 
RESET_L 
READ_L 
WRITE_L 
ALE 
ERROR_L 
AD_IN 
Figure 43'. .. ~~9kpi~!~:,;$J;;B~:AR9~~=:~;,t; 
Three VHDL descriptions of~·:tA~d)Jt~.P~~/-:- ~:~""::c~f~<Wriii~. Each description 
-... -~'}:;~}f~: .. fg.::f~;~ :;_~~,.;·(~:2~.;!·i~6~--P~~il~ 
represented DRACO at a differ8'.~~1~~i~f:: "".,~~~~::"-;~~)~~o Schematic" was 
derived from the logic schematic provided by Rockwell International. "Draco2" and 
"Draco3" were more abstract versions and each used a different style of modeling in 
VHDL. Thus the designs produced by VSS from each of these descriptions are 
quite different. 
137 
Table 10 through Table lo demonstrate optimi1,ation results for time and area 
Optimization Optimization time area 
·Tool Style (ns) (# of transistors) 
MILO Gates 194.5 5868 
MILO Module 109.0 3644 G,::merators 
MISll Gates 283.5 5800 
Table 10. Time Optimization Results for Draco2 
Optimization Optimization time area 
Tool Style (ns) (#of transistors) 
MILO Gates 226.5 5152 
, MILO Module 117.5 3390 r,,::mer::1tors 
MISll Gates 342.0 4668 
Table 11. Area Optimization Results for Draco2 
138 
Optimization Optimization time area 
Tool Style (ns) (#of transistors) 
MILO Gates 101.5 5544 
MILO Module 135.0 3968 Generators 
MISll Gates 138.5 4202 
Table 12. Time Optimization Results for Draco3 
Optimization Optimization time area 
Tool Style (ns) (# of transistors; 
'MILO Gates 115.5 5298 
MILO Module 176.5 3492 Generators 
MISll Gates 174.5 4206 
Table 13. Area Optimization Results for Draco3 
139 
Optimization Optimization time area 
Tool Style (ns) (#of transistors) 
MILO Gates 205.0 5658 
MILO Module 135.5 3026 Generators 
MISll Gates 149.0 4216 
Table 14. Time Optimization Results for Draco Schematic 
Optimization Optimization time area 
Tool Style (ns) (# of transistors) 
MILO Gates 206.0 4486 
MILO Module 136.5 3018 Generators 
MISll Gates 258.0 3762 
Table 15. Area Optimization Results for Draco Schematic 
on the DRACO examples. For examples "Draco2" and "Draco3", MILO's optimiza-
tions proved to be the best in terms ·of delay. Optimization by MILO using module 
generators resulted in the best designs in terms of area. Table 16 through Table 18 
\ 
14D 
Optimization Optimization time difference area difference 
Tool Style (%) (%) 
MILO Gates -14.2 +13.9 
MILO Module -7.2 +7.5 Generators 
MISll Gates -17.1 +24.1 
Table 16. Time/ Area Tradeoffs for Draco2 
Optimization Optimization time difference area difference 
Tool Style (%) (%) 
MILO Gates -12.1 +4.6 
MILO 
Module 
-23.5 +13.6 Generators 
MISll Gates -20.6 -0.1 
Table 17. Time/ Area Tradeoffs for Draco3 
141 
Optimization Optimization time difference area difference 
Tool Style (%) (%) 
MILO Gates -0.5 +26.1 
MILO Module -0.7 +0.3 Gener~tnr~ 
MISll Gates -42.2 +12.1 
Table 18. Time/ Area Tradeoffs for Draco Schematic 
show the tradeoff of time for area when comparing the time optimized and area 
optimized designs. Figure 44 through Figure 46 compare the three optimization 
approaches. 
In addition to comparisons of transistor counts, two layouts were generated by 
SLAM for Draco2 designs as an additional comparison. Figure 47 shows the layout 
for Draco2 that was produced from the design optimized by MILO using the module 
generators. The layout consists of two sections: the left hand portion is a custom 
layout consisting of random logic. The right hand portion of the layout is the bit-
sliced datapath produced by the module generators. Figure 48 shows the layout for 
Draco2 that was produced from the design optimized by MISH. It consists entirely 
of a custom layout for random logic. As would be expected, the design with the 
module generators is smaller than the random logic design. The total layout area of 
400 
time(ns) 
300 
200 
100 
0 
2000 3000 
142 
~•MISll 
'-.__. MILO 
only gates 
'-, MILO 
module generators + gates 
4000 5000 6000 
Area (transistors) 
Figure 44. Three Optimization Approaches for Draco2 
400 
time(ns) 
300 
200 
100 
0 
2000 
143 
MISll 
"---. J 
MILO 
module generators + gates 
3000 4000 
Area (transistors) 
MILO 
only gates 
L..e 
5000 6000 
Figure 45. Comparison of Three Optimization Approaches for Draco3 
400 
time(ns) 
300 
200 
100 
144 
, ~ MISll 
MILO 
module generators + gates 
0 
2000 3000 4000 
Area (transistors) 
5000 
MILO 
only gates 
6000 
Figure 46. Three Optimization Approaches for Draco Schematic 
145 
'1 
'1, 
·Tf"'1°"'1llf/1··'f.~,rtti.llMlllij'"''l~~,·.lf!UIU,,, ~i;~'. ;;; ,;;;; ;;;; ;:.: .. ;:. ;;;; ;;;;it;~;-· m..: 
,,, '"' ""' "". "'" ""' Ill /iii ltli1 iii!'! 'fill 
'" /If.'.' ". w~ m1 . .... ,,. 
Figure 4 7. Layout of Module Generator Design for Draco2 
146 
Figure 48. Layout of MISH Design for Draco2 
14i 
the random logic design is 14,668~600 square micrometers compared with only 
8,592,672 square micrometers in the MILO module generator design. This 
represents an area difference of 703. 
6.4.2. Analysis 
Table 19 compares optimization by MILO using only gates and straight logic 
optimization by MISII. MILO when using only gates produces faster designs in four 
of the five cases, ranging from 73 faster to 35% faster. In one of the five cases 
Benchmark MILO MISll 
(%) (%) 
Draco2 69 100 
Draco3 73 100 
Draco Schematic 138 100 
Armstrong Cntr. 64 100 
Rockwell Cntr. 93 100 
Table 19. Comparison of MILO and MISII Timing Optimization 
148 
MILO is slower by 37%. This demonstrates that MILO can produce faster designs 
on average. 
Table 20 compares optimization by MILO using only gates and straight logic 
optimization by MISII of the examples for area. In four of the five cases, MISII pro-
duces a design with a smaller area. This is to be expected as the MILO logic optim-
izer is primarily geared for time optimization. However, MILO 's optimization with 
module generators compensates by providing area efficient bit-sliced layouts. The 
best designs in terms of area were usually achieved when using module generators as 
Benchmark MILO MISll 
(%) (%) 
Draco2 111 100 
Draco3 132 100 
Draco Schematic 119 100 
Armstrong Cntr. 106 100 
Rockwell Cntr. 99 100 
Table 20. Comparison of MILO and MISII Area Optimization 
149 
shown in the tables that follow. 
Table 21 compares optimization usmg modules generators with optimization 
using only gates and optimizing for time. Table 22 shows the same comparison for 
area. The table shows that in most of the cases, optimization with the module gen-
erators produced a design with the smallest area and fastest speed. Table 23 and 
Table 24 make the same comparison with module generators but use the MISH 
results as the base for comparison. 
Benchmark 
MILO 
MILO (gates) (module gen.) 
(%) (%) 
Draco2 56 100 
Draco3 133 100 
Draco Schematic 85 100 
Armstrong Cntr. 71 100 
Rockwell Cntr. 113 100 
Table 21. MILO gate vs. MILO module generator designs (Time) 
150 
Benchmark 
MILO 
MILO (gates) (module gen.) 
(%) (%) 
Draco2 66 100 
Draco3 66 100 
Draco Schematic 67 100 
Armstrong Cntr. 81 100 
Rockwell Cntr. 91 100 
Table 22. MILO gate vs. MILO module generator designs (Area) 
151 
Benchmark 
MILO 
MISll (gates) (modul~ gen.) 
(%) (%) 
Draco2 38 100 
Draco3 97 100 
Draco Schematic 91 100 
Armstrong Cntr. 46 100 
Rockwell Cntr. 105 100 
Table 23. MIS II vs. MILO module generator designs (Time) 
152 
Benchmark MILO (module gen.) MISll(gates) 
(%) (%) 
Draco2 73 100 
Draco3 83 100 
Draco Schematic 80 100 
Armstrong Cntr. 86 100 
Rockwell Cntr. 90 100 
Table 24. MISII vs. MILO module generator designs (Area) 
These experiments demonstrate the effectiveness of MILO in generating 
efficient designs for either time or area. By optimizing the microarchitecture design 
instead of simply expanding the design and performing logic optimization, superior 
designs can be produced. Further, the results demonstrate flexibility in generating 
designs with different layout styles -- those using only gates and those incorporating 
a bit-slice capacity. 
153 
CHAPTER 7. 
CONCLUSION 
7.1. Summary 
This thesis presented novel approaches to both the problems of logic synthesis 
and microarchitecture optimization. 
First, a method of synthesis for custom layout was presented. Higher-quality 
layouts can be produced by taking layout parameters into account during the syn-
thesis process. Traditional logic synthesis techniques work well for layouts using 
standard cells but achieve less than optimal results for custom layouts. They fail to 
consider layout parameters like transistor sizing and complex gate formation. We 
implemented an algorithm for high-performance CMOS designs that incorporates 
these parameters and compared our results with those of a traditional logic syn-
thesis system, MISH. Our results demonstrate speed improvements for both custom 
and standard cell layout Further, layout driven synthesis has greater control over 
area/time tradeoffs. 
154 
Second~ this thesis described how to incorporate logic synthesis tools into the 
process of microarchitecture optimization. The logic synthesis tools are used to pro-
vide feedback on individual microarchitecture components. The microarchitecture 
optimizer can then use this information to make modifications at the register-
transfer level -- changes having much greater effect then those that can be employed 
at the logic level alone. 
Techniques were introduced for improving the microarchitecture structure and 
for employing constraint driven synthesis based on the user's requirements for time 
and area. These techniques include the capability for mixing layout styles such as 
custom layout for random-logic components and bit-slicing for regularly structured 
components. In this manner the entire design, control logic and datapath, can be 
optimized at the same time. Further, a new methodology was presented for 
microarchitecture-level optimization that greatly reduces the amount of 
technology-specific knowledge necessary to perform the optimizations. Microarchi-
tecture components are generated by a database based on a set of parameters from 
the microarchitecture optimization tool. Thus the microarchitecture optimizer does 
not need to deal with multiple logic optimization tools, layout module generators, 
transistor sizing tools, etc. 
7.2. Future Research 
A number of improvements can be made to MILO both at the logic and 
microarchitecture levels. First, at the logic level, more research is needed to refine 
155 
the interaction of factorization, complex gate formation, and transistor sizing to pro-
duce superior results. Improvements should include: (a) re-examining timing after 
the initial transistor sizing phase and complex gate formation, then redoing the 
transistor sizing based on the new timing information. and (b) performing area 
optimization along non-critical paths. 
Second, at the microarchitecture level, improvements to the initial technology-
mapping scheme can be made. Currently the technology-mapping program uses 
all-gate implementations for each of the microarchitecture components. If module 
generator implementations were used for bit-sliceable components, less time would 
be taken to generate and optimize gate-level designs that are later replaced. 
Currently microarchitecture optimization only modifies the design without 
changing the state assignment. However, further timing improvements could be 
obtained by employing "clock-splitting" -- the insertion of registers at strategic 
points to reduce the register to register delays and allow a faster clock to be used. 
In addition to adding registers, the control logic equations would need to be 
modified. 
156 
Finally, feedback could be passed to the microarchitecture optimizer from lay-
out tools. Additional delays are added to the design during the layout phase largely 
due to routing. If after layout, the timing constraints were not met, the microarchi-
tecture optimizer could use delay estimates from layout tools to identify long paths 
and then make structural changes at the microarchitecture level to reduce these 
delays. 
157 
APPENDIX A. 
LOPT Program Description 
A.1. Tutorial 
This section explains how to run "lopt ", the logic optimization program. The 
input is a IIF description (described in Appendix B), the output is a gate-level 
VHDL file. This output VHDL file can then be passed to LES for generation of a 
custom layout. 
The lopt program can be run interactively or in batch mode as demonstrated in 
the examples that follow. The examples illustrates how to use the "lopt" shellscript 
to generate a 5-bit up/ down counter from an IIF description. The first example 
shows the use of "lopt" in the interactive mode. User input to program prompts are 
shown in < >. The user is prompted for the report file name in which the 
estimated area and delays of the optimized design are placed, the capacitive load 
that an output pin must drive, the maximum clock width, maximum setup time, 
the maximum allowed delay for an output pin, and a filename in which the optim-
ized VHDL netlist is placed. ff 0 is entered for the delay values, "lopt" will attempt 
to generate the best delay. In the interactive mode, only one load and maximum 
delay value can be entered. All pins are then assumed to have these values. Using 
the batch mode, different values can be specified for separate pins. 
158 
To run the program interactively, type: 
lo pt -i count5 .iif 
The user vvill be prompted as follows: 
VHDL Output File? < count5.vhdl> 
Enter Report File Name: <report> 
Enter the load for output pin: Q_4 <15> 
Enter the clock width: < 0> 
Enter the maximum setup time: < 0> 
Enter the required delay for output pin: Q_4 < 0> 
The generated VHDL file is placed in count5.vhdl and can then be passed on to 
LES for layout. 
To run "lopt" in batch mode, four parameters must be included on the com-
rnand line: 1) IIF file, 2) VHDL output file name, 3) lopt parameter file, and 4) lopt 
report file name. Parameters 2 and 4 are files created by "lopt 11 so filenames are 
required on the command line. Parameters 1, and 3 are required input files to 
111 t II op . 
The following example demonstrates the command line for batch mode. 
To run the program in batch mode, type: 
159 
lopt -i lcnt4.iif -p lcnt4. p -r lcnt4.r -v lcnt4. vhdl 
The user will not be prompted for any inputs. Examples of the IIF and param-
eter files follow. l\ ext the file formats for the parameter and report files are 
described. 
160 
A.2. Input Files 
File count5.iif: 
NAME= Counter_5_Load_Enable_U pdown: 
!NORDER= D_O D_l D-2 D_3 
D_4 CLK LOAD ENA 
DWUP; 
OUTORDER= Q_O Q_l Q-2 Q_3 
Q_4 MINMAX RCLK ; 
C_O=l; \ 
CLKO=(CLK@Cl ENA)); 
C_l =( ( ( C_O*Q_O)*!D WUP )+( ( C_O* ! Q_O) *DWUP)); 
Q_O=( (Q_O!=C_O)@(( "r CLKO)-a( (O/(!LOAD*!D_O) ),(1/( !LOAD*D_O)))) ); 
C-2= ( ( ( C_l * Q_l) * ! D vVUP) + ( ( C_ l * ! Q_l) * D WU P)); 
Q_l=( (Q_l!=C_l)@((-r CLKO)-a( (O/(!LOAD*!D_l) ),(1/(!LOAD*D_l)))) ); 
C_3=( ( ( C-2*Q_2 )*!D WUP )+( ( C_2* ! Q-2) *DWUP)); 
Q-2 = ( ( Q-2 ! = C-2 )@( (- r CLKO) -a( ( 0 / (!LO AD*! D -2)), ( 1 / ( ! L 0 AD *D -2))))); 
C_ 4=( ( ( C_3 *Q_3) * ! D WUP) +( ( C_3* ! Q_3) *D WUP)); 
Q_3=( ( Q_3!=C_3)@( c-r CLKO)-a( (0 I (!LOAD*!D_3) ),( 1/( !LOAD*D_3)))) ); 
OVFUNF=(((C_4*Q_4)*!DWUP)+((C_4*!Q_4)*DWUP)); . 
Q_4=( (Q_4!=C_4)@( (-r CLKor a( (O/(!LOAD*!D_4) ),(1/(!LOAD*D_4)))) ); 
MINMAX=( CLK*OVFUNF); 
RCLK=( ( CLK*OVFUNF)+!OVFuNF); 
File counter5.p: 
rdelay Q_O 40 
oload Q_O 10 
rdelay Q_l 40 
oload Q_l 10 
rdelay Q-2 40 
oload Q-2 10 
rdelay Q_3 38 
oload Q_3 26 
rdelay Q_ 4 19 
oload Q_4 19 
rdelay MINMAX 40 
oload MIN MAX 10 
rdelay RCLK 40 
oload RCLK 10 
rsetup 30 
cwidth 20 
161 
162 
A.3. LOPT Parameter File Format 
The parameter file provides constraints to the optimizer on: worst case delay to 
an output pin, the load on an output pin, the worst setup time for any input pin, 
and the maximum clock width. Delays are entered in terms of nanoseconds. Load 
on output pins is entered in terms of number of unit sized transistors that need to 
be driven. For example, a load of 10 means the output gate must be able to drive 
10 unit sized transistors. 
The format of the parameter file is as follows: 
rdelay output_name worst_delay 
oload output_name load 
rsetup worst_setup_time 
cwidth maximum_clock_width 
Example parameter file (note that order of the parameters does not matter): 
oload Q [OJ 15 
oload Q[3] 15 
cwidth 4.0 
rsetup 9.0 
oload Q[l J 10 
rdelay Q[:3] 10.89 
rdelay Q[l J 4.0 
rdelay Q(2] 5.0 
rdelay Q(O] 15.89 
oload Q(2] 10 
163 
A.4. LOPT Report File Output Format 
The report file contains information on path delays and area of the design. 
There are four information types that are reported: the combinational delay from an 
input to an output (CD), the setup delay for an input pin (SD), the minimum clock 
width (CW), the transistor area (TA), and the number of transistors (NT). 
The format of the report file is as follows: 
CW rninimum_clock_width 
CD inpuLname output_name delay 
SD input_name setup_tirne 
TA transistor_area 
NT number_of_transistors 
Example report file: 
cw 36. 566666 
CD CLK Q[3) 9.800000 
CD CLK Q[2) 6. 750000 
CD CLK Q[l] 8.100000 
CD CLK Q(O] 10.684999 
SD UPDOWN 28.466667 
TA 722 
NT 212 
A.5. LOPT Usage 
"lopt 11 is a shellscript for performing logic optimization and technology mapping 
for custom layout from an IIF input description. The shellscript can be run without 
providing any parameters -- the user will be prompted interactively for required 
information. Parameters can also be used on the command line to reduce the 
amount of prompting or to run the shellscript completely in batch mode. 
164 
Shellscript Usage: 
lopt [-i IIF _input_fileJ 
[-o optimization_type] [-t technology _rnapping_type] 
[-m mis_script_style] 
[-p parameterfile] [-r reportfile] [-v VHDL_output_file] 
Options: 
-1: IIF input file 
-o: Optimzation type. Takes value 11a 11 for area, "t" for time. 
Default value is "a". 
-t: Technology mapping type. Takes value "a" for area, "t 11 for time. 
Default value is "a". 
-m: MIS II script type. Takes value "b 11 for best optimization, 
"m" for memory conservation. Using the "b 11 option, MISII 
generates all minterms and hence can run out of memory. This 
will create a memory fault. Using the 11m" option will avoid 
this problem but will not do as good of an optimization. 
Default value is "b". 
-p: Input parameter file for "lesopt" technology mapping program. 
The parameter file format is described in the file "parm.spec". 
-r: Output report file for "lesopt" technology mapping program. 
The report file format is described in the file "report.spec". 
-v: VHDL output file 
Examples of Use: 
lopt 
lopt -i ex/ exl .iif 
lopt -i ex/lcnt4.iif -v lcnt4.vhdl 
lopt -i ex/lcnt4.iif -v lcnt4.vhdl -o t -t a-mm lcnt4.vhdl 
165 
APPENTIIX B. 
IIF 
B.1. Introduction 
In order to describe microarchitecture level components (such as counters, shift 
registers, etc.), we need a format capable of describing sequential, asynchronous 
behavior, and I/O conversion. In this document, we define the Irvine Intermediate 
Form (IIF) which extends a boolean expression language with clocking and asyn-
chronous behavior in order to describe generic components composed of logic gates, 
flip-flops with asynchronous set and reset, and interface components. MILO can 
accept an IIF description, and generate a netlist of logic gates and complex cells. 
IIF is an extension of the Berkeley EQN (equation) format for describing 
boolean expressions. Besides providing the basic boolean operations of AND, OR, 
NOT, XOR, and XNOR, IIF contains operators for specifying a D flip-flop with 
asynchronous set and reset and operators for tristate, delay, and wire-or. Any com-
ponent in the GENUS library is representable in IIF. An IIF file has the following 
format: 
design name 
input/output declarations 
list of equations 
166 
An IIF description is divided into 2 parts. The first part describes the input 
and output signal names. The second part is the design description. All the exter-
nal signals must be declared before they are used in the second part. Temporary 
signals are not declared. Each IIF statement is delimited by ;. 
B.2. ITF declarations 
An IIF declaration consists of the following elements. 
NAME= name of design; 
!NORDER= input signal list; 
OUTORDER =output signal list; 
The declaration part of IIF specifies the name of the design and external inputs and 
outputs of the design. The keyword N.Al\.1E is used to assign the design narne. 
Input signals are declared in the !NORDER signal list and output signals in the 
OUTORDER signal list. Temporary signals (ie., those that are not external inputs 
or outputs to the design) are not declared. 
B.3. IIF expression 
An IIF expression is a boolean equation with some new operators. The follow-
ings are operators of IIF. 
Binary operators: 
+ 
* 
!= 
OR 
AND 
XOR 
XNOR 
w 
·g 
a 
I 
Unary operators: 
-f 
assignment 
tristate 
wire or 
167 
AT (for specify D flip flop clock) 
asynchronous assignment for D flip flop 
asynchronous AT 
NOT 
falling edge trigger clocking (for flip-flop) 
rising edge trigger clocking (for flip-flop) 
level clocking active at high (for latch) 
level clocking active at low (for latch) 
The +, *, !, !=~ and == operators are standard Boolean operations. Tristate 
operations are represented as: IO -t Control, where IO is the input and Control 
determines whether the output is IO or a high impedance state. IIF extends boolean 
expressions with operators for expressing D flip-flops with asynchronous set and 
reset inputs (©, -a). The -f, -r, -h, and -1 operators are used to specify the clock-
ing of the D flip flop. 
Sequential logic is represented in the form: 
Var= (boolean input equation) 
@ ( clock_operator clock_expression) 
-a ( asynchronous_set_reset_expression) 
For example, a falling edge triggered with set and reset is described as follows: 
Q=(D@ -r elk) -a {O/!reset,1/!set) 
The expression Q = D @ -r elk means that Q is set to D at the falling edge 
168 
(denoted by -r) of the signal elk The expression -a (O/!re.set,1/!set) indicates that 
Q is set to 0 when reset signal is 0 (denoted by O/!reset ) and is set to 1 'Nhen set 
signal is 0 (denoted by 1/!set ). The asynchronous expression is a list of descrip-
tions in the form as value/condition. It means that when the condition expression 
is true, the output is set to value. 
B.4. Exarr.J>les of IIF Usage 
Several examples of IIF descriptions are now presented. 
The first example is a register with parallel load. A schematic of the register is 
shown in Figure 49. The IIF description is as follows: 
NAME= REGISTER4; 
lNORDER = Load IO I1 12 13 CP Clear; 
OUTORDER = AO Al A2 A3; 
not_load = ! Load; 
Clr = -b Clear 
AO= ((IO*Load) + (AO*not_load)) @Cf CP) -a(O/!Clr); 
Al= ((Il*Load) + (Al*not_load))@ (~f CP) -a(O/!Clr); 
A2 = ((l2*Load) + (A2*not_load)) @Cf CP) -a(O/!Clr); 
A3 = ((I3*Load) + (A3*not_load)) @Cf CP) ~a(O/!Clr); 
A second example shows how the 4-bit adder/subtracter of Figure 50. can be 
represented in IIF. 
169 
CP-----t 
Clear-----t 
Figure 49. Register with Parallel Load 
liO 
,, ... , 11 A, 
c, 
c. s, 
Figure 50. Four-bit Adder /Subtracter 
IIF file for a 4-bit adder/ subtracter: 
N AME=adder_subtracter4 
INOR.DER=A(O] A(l] A(2] A(3) B(O) B(l) B(2) B[3] ADDSUB; 
OUTORDER=O[O) 0(1] 0[2) 0(3) COUT; 
Bl[O]=ADDSUB!=B[O); 
Bl[l)=ADDSUB!=B[l); 
Bl[2]=ADDSUB!=B[2]; 
Bl[3]=ADDSUB!=B(3); 
C(O]=ADDSUB; 
0 (O)=A (O] !=B 1 [O]!=C[O]; 
C (1 ]=A[O] *B 1 [O)+C(O] * A(O] +C(O]*B 1 (O]; 
0[1 )=A(l]!=B l(l]!=C[l]; 
C(2]=Bl(l]*B l[l]+C(l]*Bl [l]+C(l ]*B 1(1]; 
0(2]=Bl[2]!=B1[2]!=C[2]; 
C(3]=Bl (2]*B 1(2]+C(2]*Bl[2]+C(2]*B1(2]; 
0[3]=B1 (3] !=B 1 [3] !=C[3]; 
C [4]=A[3] *B 1 [3]+C[3] * A[3] +C[3]*B 1 [3]; 
COUT=C[4]; 
171 
Finally~ a third example shows the use of tristate and wireor operators in IIF. 
~ote that because of the priority of tristate operators, it is important to use 
parentheses to ensure that you get the behavior you desire. For example, "A + B 
-t C" is interpreted as "(A + B) -t C". The schematic for a tristate connection is 
shovvn in Figure 51. 
IIF file for the tristate connection: 
NAME= tristate3 
!NORDER = A B C E; 
OUTORDER = F; 
tl =A -t E; 
E 
A 
B 
c 
F 
Figure 51. Tristate Connection 
t2=B-tE; 
t3 = C -t E; 
F = tl -w t2 -w t3: 
B.5. Operator Precedence 
172 
The precedence of the operators are ranked from highest priority to lowest 
priority. Operators on the same line ha\·e the same precedence. 
(1) 
(2) 1- --.-,--
(3) * 
(4) - -t 'OJ I +, w, ' .. _, 
173 
APPENDIX C. 
l\1ILO Program Description 
C.1. Introduction 
This section explains how to run "milo". starting from a VHDL description. 
The output is a structural VHDL file consisting of rnicroarchitecture components. 
C.2. l\!filo Usage 
"milo" is a program for performing microarchitecture optimization from a 
VHDL input description. The program can be run without providing any parame-
ters -- the user will be prompted interactively for required information. Parameters 
can also be used on the command line to reduce the amount of prompting or to run 
the program completely in batch mode. 
• I 
174 
:\!ilo C sage: 
milo [-i \/HDL_input_file] [-p parameter_file] 
[-r report_file] [-o output_ VHDL_file] [-g] 
Options: 
-1: VHDL input file 
-p: Input parameter file for specifying required delays and capacitive 
loads on output pins. 
-r: Output report file describing transistor count and timing of 
the optimized design. 
-v: VHDL output file 
-g: Optimize for gates only, do not use module generators for 
bit-sliced components. 
175 
C.3. Milo Pararreter File Format 
The parameter file provides constraints to the optimizer on: worst case delay to 
an output pin: the load on an output pin, and the maximum clock width. 
The format of the parameter file is as follows: 
rdelay out pu t_name worst_delay 
oload out pu t_name load 
cwidth rnaximurn_clock_width 
Example parameter file (note that order of the parameters does not matter): 
oload Q[OJ 15 
oload Q[3] 15 
cwidth 4.0 
oload Q[l] 10 
rdelay Q(3] 10.89 
r<lelay Q[l J 4.0 
rdelay Q[2] 5.0 
rdelay Q(O] 15.89 
oload Q [2] 10 
176 
C.4. lVIILO Report File Output Forrrnt 
The report file contains information on path delays and area of the design. 
There are four information types that are reported: the worst-case combinational 
delay to an output (WD) -- measured in nanoseconds, the minimum clock width 
( CvV) -- measured in nanoseconds, the transistor area (TA) -- measured in square 
micrometers (active transistor area only, routing area is not included), and the 
number of transistors (NT). 
The format of the report file is as follows: 
CvV rninimum_clock_width 
CD input_name output_name delay 
TA transistor _area 
NT number _of_transistors 
Example report file: 
cw 36.566666 
vVD Q(3] 9.800000 
vVD Q(2J 6.750000 
WD Q[lJ 8.100000 
vVD Q[OJ 10.684999 
TA 722 
NT 212 
[Arms89] 
[BiBr88) 
[BoHa87] 
[BrFa85J 
[Br84) 
[Br86) 
[BrRu87] 
[BuMa85) 
[CaRo85) 
[ChGa90) 
[CoBa85) 
[Chu65] 
[Ci87] 
(DaJo80] 
1
--
, I 
BIBLI(x;RAPHY 
Armstrong, J., "Chip Level Modeling with VHDL", Prentice Hall, 1989. 
Birmingham, vV.P, Brennan, A.: Gupta, A.P., and Siewiorek, D.P. 1 
"MICON: A Single-Board Computer Synthesis Tool", IEEE Circuits 
and Devices j\;fagazine, January, 1988. 
Bostick, D., Hachtel, G., Jacoby, R., Lightner, M., Moceyunas, P.~ 
Morrison, C., Davenscroft, D., "The Boulder Optimal Logic Design Sys-
tem", ICCAD, 1987. 
Brownston, L., Farrell, R., Kant, E., and Martin, M., "Programming 
Expert Systems in OPS5: An Introduction to Rule-Based Program-
ming", Addison vVesley Publishing Company, 1985, pp. 228-239. 
Brayton, R., et al., "ESPRESSO IIC: Logic Minimization Algorithms for 
VLSI Synthesis", Kluwer Academic Publishers, Netherlands, 1984. 
Brayton, R., et al., "Multiple-Level Logic Optimization System", 
ICCAD, 1986. 
Brayton, R., Rudell, R., Sangiovanni-Vincentell, A. and Wang, A., 
"MIS: A Mutiple-Level Logic Optimization System", IEEE Transac-
tions on Computer-Aided Design, Vol. CAD-6, No. 6, Nov. 1987. 
Buric, M.R., and Matheson, T.G., "Silicon Compilation Environments", 
Proceedings of the Custom Integrated Circuits Conference, May, 1985. 
Camposano, R., and RosenstieL W., "A Design Environment for the 
Synthesis of Integrated Circuits" 1 llth EUROMICRO Symposium on 
Microprocessing and Microprogramming, Sept. 1985. 
Chen, G.D., and Gajski, D.D., "An Intelligent Component Database 
For Behavioral Synthesis", 27th DAG, 1990. 
Cohen, W., Bartlett, K., and dt' Geus, A., "Impact of Metarules in a 
Rule Based Expert System for Gate Level Optimization", Proc. IEEE 
Int'l. Symp. on Circuits and Systems, May 1985. 
Chu, Y., "An ALGOL-like Computer Design Language", Communica-
tions ACM, Oct. 1965. 
Cirit, Mehmet A., "Transistor Sizing in CMOS Circuits;', 24th DAG, 
1987. 
Darringer, J., Joyner, W., "A New Look at Logic Synthesis", 17th 
Design Automation Conference, 1980. 
178 
[DaJo81] Darringer, J., JoyneL \V .. Berman. C., and Trevillyan, L .. "Logic Syn-
thesis Through Local Transformations". IBA! .J. Res. Develop., 25, no. 
4, July 1981. 
[DoLe89) Domic, A., Levitin, S., Phillips. >I., Thai, C., Shiple, T., Bhavsar, D., 
and Bissell, C .. "CLEO: a C:\IOS Layout Generator", ICCA..D, 1989. 
[Dutt88) Dutt, N.D., "GENl-S: A Generic Component Library for High Level 
Synthesi_s", Technical Report 88-22, University of California, Irvine, 
Sept. 1988. 
[EnNa85) Enomoto, K., Nakamura, S., Ogihara, T., and Murai, S., "LORES-2: A 
Logic Reorganization System", IEEE Design & Test, October 1985. 
[FiDu85) Fishburn, J. P ., and Dunlop, A. E., "TILOS: A Posynomial Program-
ming Approach to Transistor Sizing", ICC AD, 1985. 
[Fo85) Forgy, C., "OPS83 User's Manual and Report", 1985. 
[GaGr84) Garrison, K., Gregory, D., Cohen, vV., and de Geus, A., "Automatic 
Area and Performance Optimization of Combinational Logic", Proc. 
IEEE Int'l. Conference on Computer-Aided Design , 1984. 
[GeCo85] de Geus, A. and Cohen, \V., "A Rule-Based System for Optimizing 
Combinational Logic", IEEE Design & Test~ August 1985. 
[GrBa86] Gregory, D., Bartlett, K., de Geus, A., and Hachtel, G., "SOCRATES: 
A System for Automatically Synthesizing and Optimizing Combina-
tional Logic", 23rd Design Automation Conference, 1986. 
[GuPa90] Gupta, G., Pastorello, D., and House, G., "Timing Optimizations in a 
High-Level Synthesis System", ICCD, 1990. 
[He87] Hedlund, Kye S., "Aesop: A Tool for Automated Transistor Sizing", 
24th DAG, 1987. 
[JoMc87] Johannsen, D., McElvain, K., and Tsubota, S., "Intelligent Compila-
tion", VLSI Systems Design, April 1987. 
[JoTr86] Joyner, W., Trevillyan, Y., Brand, D., Nix, T., and Gundersen, S., 
"Technology Adaptation in Logic Synthesis", 23rd Design Automation 
Conference, 1986. 
[Kim87] Kim, J., "Artificial Intelligence helps cut ASIC Design Time", Elec-
tronic Design, June 11, 1987. 
[Ke87] Keutzer, K., "DAGON:. Technology Binding and Local Optimization by 
DAG Matching", 24th Design Automation Conference, 1987. 
[KoLu88) Kollaritsch, P., Lusky, S., Prasad, S., and Potter, N., "CLAY: A 
Malleable-cell Transistor Matrix Approach for CMOS Layout 
li9 
Synthesis". ICCA.D. 1988. 
[LiGa88] Lin, Y-L. Ste1;e, and Gajski~ D., "LES: A Layout Expert System". IEEE 
Trans. on Computer-Aided Design, August 1988. 
[LiGa89] Lis, J.S., and Gajski, D.D., "VHDL Synthesis Using Structured :Vlodel-
ing", 26th DAG~ 1989. 
[:VIcDe82] ~1cDermott, J., "Rl: A Rule-Based Configurer of Computer Systems". 
Artificial Intelligence, 19(1) (1982). 
[McPa88] McFarland, ::Vf., Parker, A., and Camposano, R., "Tutorial on High-
Level Synthesis", 25th Design Automation Conference, 1988. 
[ObKa88] Obermeier, Fred W.,- and Katz, Randy H., "An Electrical Optimizer 
that Considers Physical Layout", 25th DAG, 1988. 
[OrGa86] Orailoglu, A., and Gajski, D., "Flow Graph Representation", 23rd DAG, 
June, 1986. 
[PaGa87] Pangrle, B.M., and Gajski, D.D., "Design Tools for Intelligent Silicon 
Compilation", IEEE Transactions on Computer-Aided Design, Vol. 6, 
:No. 6, November 1987. 
[PaKn87] Paulin, P.G., and Knight, J.P., "Force-Directed Scheduling for the 
Behavioral Synthesis of ASIC's", IEEE Transactions on Computer-
Aided Design, Vol. 8, No. 6, June 1987. 
[PaPM86] Parker, A.C., Pizarro, J., and Milnar, M., "MAHA: A Program for 
Datapath Synthesis", 23rd DAG, 1986. 
[SiSV90) Singh, K.J., and Sangiovanni-Vincentelli, A., "A Heuristic Algor~thm 
for the Fanout Problem", 27th DAG, 1990. 
[StMu86) Stroud, C.E, Munoz, R.R., and Pierce, D.A., "CONES: A System for 
Automated Synthesis of VLSI and Programmable Logic From 
Behavioral Models", ICCAD, 1986. 
[Tj86] Tjiang, S., "Twig Reference Manual", January 1986. 
[TrDi89) Trick, M.T., and Director, S.W., "LASSIE: Structure to Layout for 
Behavioral Synthesis Tools", 26th DAG, 1989. 
[TsSi86] Tseng, C.J., and Siewiorek, D.P., "Automated Synthesis of Data Paths 
in Digital Systems", IEEE Transactions on Computer-Aided Design, 
Vol. 5, No. 3, July 1986. 
[TsWe88) Tseng, C.J., Wei, R.S., Rothweiler, S.G., Tong, M.M., and Bose, A.K., 
"Bridge: A Versatile Behavioral Synthesis System", 25th DAG, 1988. 
180 
[VaGa88] Vander Zanden, N.B.~ and Gajski, D.D., ":\HLO: A :\,1icroarchitecture 
and Logic Optimizer", 25th DA.C. 1988. 
[WeRo88] Wei, R.S, Rothweiler, R., and .Jou, J.Y., "BECO:V'lE: Behavior Level 
Circuit Synthesis Based On Structure Ylapping", 25th DA..C 1988. 
[vVuGa90] vVu, C.H., and Gajski, D.D., "Silicon Compilation from Register-
Transfer Schematics", International Symposium on Circuits and Sys-
tems, 1990. 
[vVuVG90) Wu, C.H., Vander Zanden, ?'J., and Gajski, D.D., "A New Algorithm for 
Transistor Sizing in CMOS Circuits", European Design Automation 
Conference, 1990. 
[YeGh88] Yen, H.C., Ghanta, S., and Du, H.C., "A Path Selection Algorithm for 
Timing Analysis", 25th DAG, 1988. 
~ \\\ \\\\\\ \ \\ \\\\\\\\\I\\\ I\\\ I\\\\\\\\\\\\ I\\\\\\\\\\\\\\ I\\\\ 
3 1970 00882 3970 
181 
, 
VITA 
~'els Blake Vander Zanden was born on September 8, 1965 1 in Columbus, Ohio. 
He received a B.S. in computer science from Ohio State University in 1984 and a 
M.S. in computer science from the University of Illinois in 1986. During his term as 
a doctoral candidate, he received a university fellowship in 1984 and later worked as 
a research assistant at UIUC. Mr. Vander Zanden also worked during the summer 
of 1987 at Applied Micro Circuits Corporation in San Diego, CA 1 and taught com-
puter hardware design classes at the University of California, Irvine in the winter 
quarters of 1990 and 1991. His non-academic interests include tennis, volleyball, 
and music. 
