Implementation of arithmetic primitives using truly deep submicron technology (TDST) by Eshraghian, Sholeh
Edith Cowan University 
Research Online 
Theses: Doctorates and Masters Theses 
1-1-2004 
Implementation of arithmetic primitives using truly deep 
submicron technology (TDST) 
Sholeh Eshraghian 
Edith Cowan University 
Follow this and additional works at: https://ro.ecu.edu.au/theses 
 Part of the Electrical and Computer Engineering Commons 
Recommended Citation 
Eshraghian, S. (2004). Implementation of arithmetic primitives using truly deep submicron technology 
(TDST). https://ro.ecu.edu.au/theses/771 
This Thesis is posted at Research Online. 
https://ro.ecu.edu.au/theses/771 
Edith Cowan University 
  
Copyright Warning 
  
 
  
You may print or download ONE copy of this document for the purpose 
of your own research or study. 
 
The University does not authorize you to copy, communicate or 
otherwise make available electronically to any other person any 
copyright material contained on this site. 
 
You are reminded of the following: 
 
 Copyright owners are entitled to take legal action against persons 
who infringe their copyright. 
 
 A reproduction of material that is protected by copyright may be a 
copyright infringement. Where the reproduction of such material is 
done without attribution of authorship, with false attribution of 
authorship or the authorship is treated in a derogatory manner, 
this may be a breach of the author’s moral rights contained in Part 
IX of the Copyright Act 1968 (Cth). 
 
 Courts have the power to impose a wide range of civil and criminal 
sanctions for infringement of copyright, infringement of moral 
rights and other offences under the Copyright Act 1968 (Cth). 
Higher penalties may apply, and higher damages may be awarded, 
for offences and infringements involving the conversion of material 
into digital or electronic form.
Implementation of Arithmetic Primitives Using Truly 
Deep Submicron Technology (TDST) 
r:DITH CGWAN UiJIVERSITY 
UBRARY 
A Thesis submitted for the degree of 
Master of Engineering Science 
at 
School of Engineering and Mathematics 
Faculty of Communications, Health and Science 
Edith Cowan University 
by 
Sholeh Eshraghian 
Principle Supervisor: Dr Stefan Lachowicz 
February 2004 
USE OF THESIS 
 
 
The Use of Thesis statement is not included in this version of the thesis. 
DECLARATION 
I certify that this thesis does not, to best of my knowledge and belief 
(i) incorporate without acknowledgement any material previously submitted for a 
degree or diploma in any institution of higher education; 
(ii) contain any material previously published or written by another person except 
where due reference is made in the text; or 
(iii) contain any defamatory material. 
Signature 
11 
" 
Acknowledgements 
First and foremost I praise and thank God and Baha'u'llah for giving me the opportunity, 
ability and strength needed to complete this thesis. Then I would like to express my deepest 
and sincerest gratitude to my supervisor Dr Stefan Lachowicz, for his vision, effort, 
flexibility, guidance and encouragement and for allowing me to utilise various approaches 
towards my research. 
I am tremendously grateful to my beloved husband Professor Kamran Eshraghian, for 
giving me his unconditional love and support, for his tireless efforts, his endless enthusiasm 
and strong belief in me, his constant motivation and for always finding time amidst his 
extremely hectic schedule to alleviate any problems, answer questions and provide 
directions. 
My appreciations also go to my children Ashkaan, Omid, Elham, Natasha and Kamran Jnr. 
for being patient and sensitive, and putting up with at times a much occupied Mum. 
Ashkaan in moments of desperation helped enormously with his little brother and sister and 
provided me with "sustenance". I would also like to thank my Mum, Manijeh Heshmat, for 
her love, support and belief in me, and for always teaching me to reach my best. I thank my 
brother Arjang Pirmorady for all his support and for coming to my rescue whenever "the 
going got tough". I am inspired by the thoughts of the love and encouragement given to me 
by my late father, Dr Tiamoor Pirmorady. Also one of the major motivators for the 
completion of my thesis is my uncle, Mostafa Heshmat, whose loving thoughts are always 
with me. 
During the course of this research, I have had the privilege of working at Ulm University, 
in Germany, and I would like to express my deepest appreciation to Professor Hans-Jorg 
Pfleiderer for allowing me to work in his department and for creating a wonderful and 
motivating environment at Ulm University, facilitating some of the major developments 
within this research. I wish to express my warmest gratitude to Mrs Erika Pfleiderer for 
being a wonderful friend and support during my stay in Ulm. My appreciations also go to 
my colleagues, Mr Oliver Pfander, Dr Fang, Mrs Hoofer, and all the technical staff at Ulm 
University for their input and support. 
The help and contribution of Professor Steve Kung, Dr Amin Bermak, Dr Alex Rassau, Mr 
David Lucas, Mr Seung-Min Lee, and Mr Greg Yu, is greatly appreciated. 
Sholeh Eshraghian 
iii 
Abstract 
The invention of the transistor in 194 7 at Bell Laboratories revolutionised the electronics 
industry and created a powerful platform for emergence of new industries. The quest to 
increase the number of devices per chip over the last four decades has resulted in rapid 
transition from Small-Scale-Integration (SSI) and Large-Scale-Integration (LSI), through to 
the Very-Large-Scale-Integration (VLSI) technologies, incorporating approximately 10 to 
100 million devices per chip. The next phase in this evolution is the Ultra-Large-Scale­
Integration (ULSI) aiming to realise new application domains currently not accessible to 
CMOS technology. 
Although technology is continuously evolving to produce smaller systems with minimised 
power dissipation, the IC industry is facing major challenges due to constraints on power 
density (W/cm2) and high dynamic (operating) and static (standby) power dissipation. 
Mobile multimedia communication and optical based technologies have rapidly become a 
significant area of research and development challenging a variety of technological fronts. 
The future emergence of 4G (4th Generation) wireless communications networks is further 
driving this development, requiring increasing levels of media rich content. The processing 
requirements for capture, conversion, compression, decompression, enhancement and 
display of higher quality multimedia, place heavy demands on current ULSI systems. This 
is also apparent for mobile applications and intelligent optical networks where silicon chip 
area and power dissipation become primary considerations. In addition to the requirements 
for very low power, compact size and real-time processing, the rapidly evolving nature of 
telecommunication networks means that flexible soft programmable systems capable of 
adaptation to support a number of different standards and/or roles become highly desirable. 
In order to fully realise the capabilities promised by the 4G and supporting intelligent 
networks, new enabling technologies are needed to facilitate the next generation of personal 
communications devices. 
Most of the current solutions to meet these challenges are based on various implementations 
of conventional architectures. For decades, silicon has been the main platform of 
computing, however it is slow, bulky, runs too hot, and is too expensive. Thus, new 
approaches to architectures, driving multimedia and future telecommunications systems, are 
needed in order to extend the life cycle of silicon technology. 
The emergence of Truly Deep Submicron Technology {TDST) and related 3-D 
interconnection technologies have provided potential alternatives from conventional 
architectures to 3-D system solutions, through integration of TDST, Vertical Software 
Mapping and Intelligent Interconnect Technology {IIT). The concept of Soft-Chip 
Technology (SCT) entails integration of "Soft-Processing Circuits" with "Soft-Configurable 
Circuits". This concept can effectively manipulate hardware primitives through vertical 
integration of control and data. Thus the notion of 3-D Soft-Chip emerges as a new design 
iv 
algorithm for content-rich multimedia, telecommunication and intelligent networking 
system applications. 
3-D architectures (design algorithms used suitable for 3-D soft-chip technology), are driven by three factors. The first is development of new device technology (TDST) that can 
support new architectures with complexities of 1 OOM to 1 OOOM devices. The second is 
development of advanced wafer bonding techniques such as Indium bump and the more 
futuristic optical interconnects for 3-D soft-chip mapping. The third is related to improving 
the performance of silicon CMOS systems as devices continue to scale down in 
dimensions. 
One of the fundamental building blocks of any computer system is the arithmetic 
component. Optimum performance of the system is determined by the efficiency of each 
individual component, as well as the network as a whole entity. Development of 
configurable arithmetic primitives is the fundamental focus in 3-D architecture design where functionality can be implemented through soft configurable hardware elements. 
Therefore the ability to improve the performance capability of a system is of crucial 
importance for a successful design. Important factors that predict the efficiency of such arithmetic components are: 
• The propagation delay of the circuit, caused by the gate, diffusion and wire 
capacitances within the circuit, minimised through transistor sizing, and 
• Power dissipation, which is generally based on node transition activity. [2] 
Although optimum performance of 3-D soft-chip systems is primarily established by the choice of basic primitives such as adders and multipliers, the interconnecting network also has significant degree of influence on the efficiency of the system. 3-D superposition of devices can dycrease interconnect delays by up to 60% compared to a similar planar architecture. 
This research is based on development and implementation of configurable arithmetic primitives, suitable to the 3-D architecture, and has these foci: 
• To develop a variety of arithmetic components such as adders and multipliers with 
particular emphasis on minimum area and compatible with 3-D soft-chip design paradigm. 
• To explore implementation of configurable distributed primitives for arithmetic processing. This entails optimisation of basic primitives, and using them as part of 
array processing. 
In this research the detailed designs of configurable arithmetic primitives are implemented using TDST 0.13µm (130nm) technology, utilising CAD software such as Mentor Graphics 
and Cadence in Custom design mode, carrying through design, simulation and verification steps. 
V 
Publications 
K. Eshraghian, S. Lachowicz, X. Zhao, S. Eshraghian, and A.Osseiran, "Test for Research 
Engineering: The World's First Remotely Accessible Tele-testing Facility", 12th Asian 
Test Symposium, China, November 17-19, 2003. 
S. Eshraghian, S. Lachowicz, and K. Eshraghian, "Ultra High Bandwidth Image and Data 
Processing Using 3-D Vertically Integrated Architectures", 7th World Multiconference on 
Systemics, Cybernetics and Informatics (SCI2003), Orlando, Florida, USA, July 27-30, 
2003. 
S. Eshraghian, S. Lachowicz, K. Eshraghian, "3-D Vertically Integrated Configurable Soft­Chip with Terabit Computational Bandwidth for Image and Data Processing", Proceedings 
10th International Conference Mixed Design of Integrated Circuits and Systems 
MIXDES'2003, Lodz, Poland, June 26-28, 2003. 
K. Eshraghian, S. Lachowicz, S. Eshraghian, A. Osseiran, "Networked Teletesting Facility 
for Integrated Systems", Semiconductor Equipment and Materials International Conference 
SEMCON'2003, Singapore, August 12-14, 2003. 
A. Rassau, K. Eshraghian, S. Eshraghian, S. Lachowicz, A. Ehrhardt, Y. Nemirovski, R. 
Ginasor, "3-D Soft-Chip Configurable Array Processor for Multimedia and Communications", Proceedings, 5th International Conference on High-Speed Networks and 
Multimedia Communications HSNMC'02, Jeju Island, Korea (keynote paper) July 3-5, 
2002. 
K. Eshraghian, S. Lachowicz, and S. Eshraghian, "The Networked Tele-test Facility for 
Integrated Systems in Australia", Proceedings, 9th International Conference Mixed Design 
oflntegrated Circuits and Systems MIXDES '2002, Poland, June 20-22, 2002, pp. 689-694. 
K. Eshraghian, S. Lachowicz, and S. Eshraghian, "Australian National Networked Tele-test 
Facility for Integrated Systems", Electronics and Structures for MEMS II, Bergmann N., 
Editor, Proceedings of SPIE, volume 4591, pp 22-27, 2001. 
VI 
Contents 
Declaration 
Acknowledgments 
Abstract 
Publications 
List of Tables 
List of Figures 
Chapter 1 Integrated Silicon Systems 
1.0 Background 
1. 1 Optical Switching 
1.2 Opto-VLSI Optical Switch Implementation 
1.3 Computer Generated Phase Hologram (CGH) 
1.4 Summary 
Chapter 2 3-D Soft Chip Paradigm 
2.0 Summary 
2. 1 Soft-Chip Technology 
2.2 3-D Implementation Strategy 
2.3 2-D to 3-D Configurable Array Architectural Transformation 
2.4 Indium Bump Vertical Interconnects 
2.5 3-D Soft-Chip Architecture 
2.6 Configurable and Scalable ALU Cell 
2.6. 1 Generic 4-Bit ALU 
2.7 System Design Cycle 
2.8 Conclusions 
Chapter 3 Evaluation of Adder and Multiplier Designs 
3.0 Summary 
3.1 Adders 
3 . 1. 1  Ripple Carry Adder (RCA) 
3. 1.2 Carry Save Adder (CSA) 
3. 1.3 Carry Bypass Adders (CBA) 
3. 1.4 Carry Skip Adder (CKA) 
3. 1.5 Carry Look-Ahead Adder (CLA) 
3. 1.6 Carry Select Adder (CSA) 
3 . 1. 7 Brent-Kung Adder 
3.2 Multiplication Algorithms 
3.2. 1 Multiplier Design Through Coefficient Optimization 
3.2.2 Tree Based Multiplier 
3.2.3 Wallace-Tree Multiplier 
vii 
ii 
iii 
iv 
vi 
xii 
xiii 
1 
1 
3 
5 
6 
7 
9 
9 
9 
9 
10 
1 1  
12 
14 
14 
15 
15 
17 
17 
17 
17 
18 
19 
19 
19 
20 
2 1  
2 1  
22 
23 
23 
3.2.4 Dadda Multiplier 24 
3.2.5 Booth Encoding Multiplier 25 
3.2.6 Braun Multiplier 26 
3.3 Configurable Multiplier Array 27 
3.3.1 Simple Chain Array Algorithms 27 
3.3.2 Array Design with Periphery Multiplexer Per Cell 27 3.3.3 Array Design with Periphery Multiplexer Per Array 28 3 .4 Conclusions 29 
Chapter 4 Logic Styles Mapping 30 
4.0 Summary 30 
4.1 Literature Review for Logic Styles 30 
4.1.1 Comparison of Logic Styles 30 
4.2 XOR/XNOR Logic Styles 31 
4.3 Classic Static CMOS Logic Styles 36 
4.3.1 Direct Adder Implementation 36 
4.3.2 Symmetric Adder Implementation 36 
4.3.3 Multiplexer Based Adder Design 37 4.3.4 Transmission Gate Based Adder 40 
4.3.5 SPL and SPLTV Based Adder 41 
4.4 Conclusions 43 
Chapter 5 Scalable Serial/Parallel Multiplication Primitives 45 
5.0 Summary 45 
5.1 Configurable Serial/Parallel Multiplication Algorithm 45 
5.2 Logical Architecture of Proposed Configurable Serial/Parallel Multipliers 47 
5.3 Layout Implementation of Serial/Parallel Multiplier 51 
5.3.1 Layout of I-Bit Serial/Parallel Multiplier 51 
5.3.2 Register Circuit Layout 52 
5.3.3 Multiplexer Circuit Layout 53 
5.3.4 Bit Slice Layout 54 
5.3.5 4/8/12/16/20/24/28/32-Bit Clocked Serial/Parallel Multiplier Layout 56 
5.4 Conclusions 57 
Chapter 6 Scalable Parallel/Parallel Multiplier Design 58 
6.0 Summary 58 
6.1 Decision Issues 58 
6.2 Unsigned Multiplication 59 6.2.1 Symmetric Based Multiplier 59 
6.2.1.1 1-Bit Symmetric Based Multiplier Circuit 59 
6.2.1.2 2-Bit Symmetric Based Multiplier Circuit 62 
6.2.1.3 4-Bit Symmetric Based Multiplier Circuit 63 
6.2.1.4 8-Bit Symmetric Based Multiplier Circuit 64 
6.2.2 Multiplexer Based Multiplier Circuit 65 
6.2.2.1 1-Bit Multiplexer Based Multiplier Circuit 65 
6.2.2.2 2-Bit Multiplexer Based Multiplier Circuit 66 
6.2.2.3 4-Bit Multiplexer Based Multiplier Circuit 66 
6.2.2.4 8-Bit Multiplexer Based Multiplier Circuit 67 
6.2.3 Multiplier Design with Periphery Multiplexer Per Bit 67 
Vlll 
6.2.3.1 1-Bit Multiplier with Periphery Multiplexer Per Bit 70 
6.2.3.2 2-Bit Multiplier with Periphery Multiplexer Per Bit 71 
6.2.3.3 4-Bit Multiplier with Periphery Multiplexer Per Bit 72 
6.2.3.4 8- Bit Multiplier with Periphery Multiplexer Per Bit 73 
6.2.4 Multiplier Design with Periphery Multiplexer Per Array 74 
6.2.4.1 2-Bit Multiplier Circuit with Peripheral Multiplexers Per 74 
Array 
6.2.4.2 4-Bit Multiplier Circuit with Peripheral Multiplexers Per 75 
Array 
6.2.4.3 8-Bit Multiplier Circuit with Peripheral Multiplexers Per 4- 76 
Bit Array 
6.2.5 Transmission Gate (TG) Based Multiplier Circuit 77 
6.2.5.1 1-Bit TG Based Multiplier Circuit 77 
6.2.5.2 2-Bit TG Based Multiplier Circuit 78 
6.2.5.3 4-Bit TG Based Multiplier Circuit 79 
6.2.5.4 8-Bit TG Based Multiplier Circuit 79 
6.3 Signed Multiplication 81 
6.3.1 Signed Multiplexer Based Multipliers 81 
6.3.1.1 2-Bit 2's Complement Multiplexer Based Multiplier Circuit 81 
6.3.1.2 4-Bit 2's Complement Multiplexer Based Multiplier Circuit 84 
6.3.1.3 8-Bit 2's Complement Multiplexer Based Multiplier Circuit 85 
6.3.2 Signed TG Based Multiplier Circuit 85 
6.3.2.1 2-Bit 2's Complement TG Based Multiplier Circuit 85 
6.3.2.2 4-Bit 2's Complement TG Based Multiplier Circuit 87 
6.3.2.3 8-Bit 2's Complement TG Based Multiplier Circuit 88 
6.3.3 An Alternative Algorithm for Implementing Signed Multiplication 88 
6.3.3.1 Multiplier Circuit with NAND Capability 91 
6.3.3.2 Improved 4-Bit 2's Complement Multiplier Circuit 93 
6.3.3.3 Improved 8-Bit 2's Complement Multiplier Circuit 94 
6.4 Observations and Comparison 95 
6.5 Projection of Technology into 0.09µm (90nm) Process 97 
6.6 Conclusions 100 
Chapter 7 Implementation of Configurable ALU Architecture and Future 101 
Direction 
7.0 Summary 101 
7 .1 ALU Implementation 101 
7 .1.1 Comparator 101 
7.1.2 AND-OR Circuit 103 
7.1.3 ALU Pixel Floor-Plan 104 
7 .2 Future Direction and Implementation Strategy 106 
7.3 Projection of Technology 107 
7.4 Conclusions 107 
Chapter 8 Conclusion 108 
8.1 Overview 108 
8.2 Discussions 109 
8.3 Future directions 111 
lX 
Reference 113 
Appendix A Parallel/Parallel Multiplier Schematics and Simulation Results 120 
A.0 Summary 120 
A.1 Unsigned Multipliers 120 
A.1.1 Unsigned Symmetric Based Multipliers 120 
A.1.1.1 1-Bit Symmetric Based Multiplier Circuit 120 
A.1.1.2 2-Bit Symmetric Based Multiplier Circuit 121 
A.1.1.3 4-Bit Symmetric Based Multiplier Circuit 121 
A.1.1.4 8-Bit Symmetric Based Multiplier Circuit 122 
A.1.2 Unsigned Multiplexer Based Multipliers 122 
A.1.2.1 1-Bit Multiplexer Based Multiplier Circuit 122 
A.1.2.2 2-Bit Multiplexer Based Multiplier Circuit 123 
A.1.2.3 4-Bit Multiplexer Based Multiplier Circuit 123 
A.1.2.4 8-Bit Multiplexer Based Multiplier Circuit 124 
A.1.3 Unsigned Multiplexer Based Multiplier Design with Periphery 124 
Multiplexers Per Cell 
A.1.3.1 Vertical Multiplexer Circuit 125 
A.1.3.2 Horizontal Multiplexer Circuit 125 
A.1.3 .3 1-Bit Multiplier with Periphery Multiplexers Per Bit 125 
A.1.3.4 2-Bit Multiplier with Periphery Multiplexers Per Bit 126 
A.1.3.5 4-Bit Multiplier with Periphery Multiplexers Per Bit 126 
A.1.3.6 8-Bit Multiplier with Periphery Multiplexers Per Bit 127 
A.1.4 Unsigned Multiplexer Based Multiplier Design with Periphery 128 
Multiplexers Per Array 
A.1.4.1 2-Bit Multiplier Circuit with Peripheral Multiplexers Per 128 
Array 
A.1.4.2 4-Bit Multiplier Circuit with Peripheral Multiplexers Per 128 
Array 
A.1.4.3 8-Bit Multiplier Circuit with Peripheral Multiplexers Per 129 
4-Bit Array 
A.1.5 Unsigned Transmission-Gate (TG) Based Multiplier Circuit 129 
A.1.5.1 1-Bit TG Based Multiplier Circuit 129 
A.1.5.2 2-Bit TG based Multiplier Circuit 130 
A.1.5.3 4-Bit TG based Multiplier Circuit 130 
A.1.5.4 8-Bit TG Based Multiplier Circuit 131 
A.2 Signed Multiplication 132 
A.2.1 Signed Multiplexer Based Multipliers 132 
A.2.1.1 2-Bit 2 's Complement Multiplier Circuit 132 
A.2.1.2 4-Bit 2's Complement Multiplier Circuit 133 
A.2.1.3 8-Bit 2's Complement Multiplier Circuit 133 
A.2.2 Signed TG Based Multipliers 134 
A.2.2.1 2-Bit TG based 2's Complement Multiplier Circuit 134 
A.2.2.2 TG based 4-Bit 2's Complement Multiplier Circuit 134 
A.2.2.3 8-Bit 2's Complement TG based Multiplier Circuit 135 
A.2.3 An Alternative Improved Algorithm for Implementing Signed 135 
Multiplication 
A.2.3.1 Improved 4-Bit 2's Complement TG based Multiplier 136 
Circuit 
X 
A.2.3.2 Improved 8-Bit 2's Complement Multiplier Circuit 137 
Appendix B Circuit Names and Descriptions 138 
xi 
List of Tables 
1.1 Technology Road Map 1 
4.1 Variety of Logic Styles 31 
4.2 Comparison of XOR Gates 34 
4.3 Power Dissipation Elements 35 
4.4 Power-Delay Product with O.Ol pF Load 35 
5 .1 Configuration For Variable Word Length Multiplications up to 32-bit 50 5.2 8-Input Multiplexer Behaviour 54 
6.1 Behavioural Description of (a) AND Gate (b) Full Adder Circuit 59 
6.2 Behavioural Description of Mux&Mult Circuit 68 
6.3 Behavioural Description of the Vertical Multiplexer Circuit 68 
6.4 Behavioural Description of Horizontal Multiplexer Circuit 69 
6.5 Summary and Comparison of the Basic I-Bit Multiplier Structures 95 
6.6 Summary and Comparison of Area for Array of Unsigned Multiplier 95 
Structures 
6.7 Summary and Comparison of Area for Signed Multiplication Using 96 
0.13um, 1.2V Process 
6.8 Summary and Comparison of Area for Unsigned Serial/Parallel and 96 
Parallel/Parallel Multiplication Using 0.13µm, 1.2V Process 
6.9 Estimated Area Projection for Multiplier Configurations, from 0.13 Om l .2V 98 to 0.090m 1 V Process Technology 
XU 
List of Figures 
1. 1 3G and 4G Capabilities 2 
1.2 An Example of Optical Interconnection Link 3 
1.3 Trends on Transmission Capacity Technology 3 
1.4 A Generic Optical Switch Within the Framework of an Optical Network 4 
1.5 The Physical Architecture for Configurable Optical Switch 4 
1.6 The Opto-VLSI Processor Structure 5 
1.7 Cross-Section ofOpto-VLSI Chip for Beam Steering 5 
1.8 Opto-VLSI Processor and Implementation on Silicon 6 
1.9 Schematic of a Generalised Two-Dimensional Holographic Interconnect 6 
1. 10 A CGH Pattern and its Fourier Transform 7 
2. 1 3-D Soft Chip Platform 9 
2.2 2-D Architecture for Configurable Array Processor 10 
2.3 Mapping of Single 2-D Chip into Two Vertically Integrated 3-D Chips 1 1  
2.4 3-D Soft-Chip Physical Architecture 12 
2.5 Indium Bump Process Flow 12 
2.6 Single 15µm Indium Bump after Reflow 12 
2.7 Floor-Plan of CAP Chip (Lower Chip) 13 
2.8 Interconnect Strategy for 3-D Soft Chip 13 
2.9 Logical Architecture for a Generic 4-Bit ALU 14 
2. 10 System Design Cycle Flow Chart 15 
3. 1 Block Diagram of a Ripple Carry Adder (CRA) 18 
3.2 Block Diagram of a Carry Save Adder (CSA) 18 
3.3 4-Bit Carry Save Adder 19 
3.4 Carry Look-Ahead Generator Cell (CLG) 20 
3.5 Brent-Kung Carry Look-Ahead Adder 2 1  
3.6 Modified Full Adder Cell 22 
3.7 Power Dissipation of Modified Array Multiplier as a Function of the 23 
Supply Voltage (0.35µm CMOS Process Technology) 
3.8 Tree Based Multiplication 23 
3.9 A 6-Bit Wallace-Tree Multiplier 24 
3. 10 6-Bit Dadda Multiplier 25 
3. 1 1  3-Bit Braun Multiplier Block Diagram 26 
3. 12 4x4 Chain Array 27 
3. 13 4x4 Array Design with Periphery Multiplexer Per Cell 28 
3. 14 4x4 Array Design with Periphery Multiplexer Per Array 28 
4. 1 (a) XOR/XNOR Logic and (b) CMOS Implementation 32 
4.2 (c,d) CPL Implementation ofXORIXNOR Circuits 32 
4.3 DPL Implementation of XORIXNOR Circuits 32 
4.4 Implementation of 4 Transistor XOR/XNOR Circuits 33 
4.5 Groundless XOR/XNOR Gates 33 
4.6 Dissipated Power for Various XOR Circuit Configurations 35 
4. 7 Full Adder Block Diagram 36 
4.8 Symmetric Adder Schematic 37 
xm 
4.9 1-Bit Multiplexer Based Full Adder (MUXAl )  37 
4.10 MUXA2 Design Variation 38 
4.11 The MUXA3 Design 38 
4.12 The MUXA4 Design 39 
4.13 2-Input CMOS Multiplexer Using Pass-Gate CMOS Logic 39 
4.14 Multiplexer Based Adder 40 
4.15 1-Bit TG Based Adder Schematic 41 
4.16 Single Phase Logic (SPL) Based Adder 41 
4.17 SPL and SPL TV 42 
4.18 SPL Comparison of Single Voltage SPL and Two Voltage SPL (SPL TV) 42 
4.19 Power and Delay Characteristics of SPL, SPLTV and CPL Full Adders 43 
(0.35µm, 3.3V Process Technology) 
4.20 Power and Delay of a Full Adder Using SPL and CPL for 0.18µm, 1.8V 43 
Process Technology 
5.1 Configurable 4/8/16-Bit Serial/Parallel Multiplier 46 
5.2 8-Bit Array Multiplication Implemented Using 4-Bit Multiplier Primitive 46 
5.3 16-Bit Array Multiplication Configuration 46 
5.4 32-Bit Array Multiplication Configuration 47 
5.5 64-Bit Array Multiplication Configuration 47 
5.6 16/32/64-Bit Serial/Parallel Multiplier Configuration 47 
5.7 Logical Architecture for a Configurable 4/8/12/16/20/24/28/32-Bit Serial 49 
/Parallel Multiplier 
5.8 1-Bit Multiplier Block Diagram 51 
5.9 1-Bit Multiplier Layout Using 0.13µm Process and 1.2V Supply 51 
5.10 1-Bit Multiplier Simulation 52 
5.11 Register Block Diagram 52 
5.12 Register Layout Using 0.13µm Process and 1.2V Supply 52 
5.13 Multiplexer Logical Architecture 53 
5.14 Multiplexer Layout Using 0.13µm Process and 1.2V Supply 53 
5.15 8-Input Multiplexer Block Diagram 53 
5.16 8-Input Multiplexer (a) Schematic, (b) Layout 54 
5.17 4-Bit Multiplier with Registers and Capability for Word Expansion 54 
5.18 4-Bit Clocked Serial/Parallel Multiplier Layout in O .13 µm, 1.2V 55 
Process 
5.19 Hspice Simulation for 4-Bit Serial/Parallel Multiplier 55 
5.20 Layout for 4/8/12/16/20/24/28/32-Bit Configurable Serial/Parallel 56 
Multiplier in 0.13um, 1.2V Supply Process 
5.21 Simulation for 4-Bit Multiplication 56 
5.22 Simulation for 32-Bit Multiplication 57 
6.1 1-Bit Symmetric Based Multiplier Block Diagram 59 
6.2 AND Circuit Schematic 60 
6.3 Symmetric Full Adder Schematic 61 
6.4 1-Bit Symmetric Based Multiplier (a) Schematic (b) Layout 61 
6.5 1-Bit Symmetric Based Multiplier Simulation Result 62 
6.6 Logical Architecture for 2-Bit Symmetric Based Multiplier 62 
6.7 Layout for 2-Bit Symmetric Based Multiplier 63 
6.8 Logical Architecture for 4-Bit Symmetric Based Multiplier 63 
6.9 4-Bit Symmetric Based Multiplier Layout 64 
XIV 
6.10 8-Bit Symmetric Based Multiplier (a) Schematic (b) Layout 64 
6.11 1-Bit Multiplexer Based Multiplier Block Diagram 65 
6.12 I-Bit Multiplexer Based Multiplier (a) Schematic (b) Layout 65 
6.13 1-Bit Multiplexer Based Multiplier Simulation Results 66 
6.14 2-Bit Multiplexer Based Multiplier Layout 66 
6.15 4-Bit Multiplexer Based Multiplier Layout 67 
6.16 8-Bit Multiplexer Based Multiplier (a) Schematic (b) Layout 67 
6.17 1-Bit Multiplier with Periphery Multiplexer Per Bit Block Diagram 68 
6.18 Vertical Multiplexer Block Diagram 68 
6.19 Vertical Multiplexer (a) Schematic (b) Layout 69 
6.20 Vertical multiplexer simulation results 69 
6.21 Horizontal Multiplexer Block Diagram 69 
6.22 Horizontal Multiplexer (a) Schematic (b) Layout 70 
6.23 Horizontal Multiplexer Simulation Results 70 
6.24 I-Bit Multiplier with Periphery Multiplexer Per Bit (a) Schematic (b) 71 
Layout 
6.25 1-Bit Multiplier with Periphery Multiplexer Per Bit Simulation 71 
6.26 2-Bit Multiplier with Periphery Multiplexer Per Bit Block Diagram 72 
6.27 2-Bit Multiplier with Periphery Multiplexer Per Bit Layout 72 
6.28 4-Bit Multiplier with Periphery Multiplexer Per Bit Block Diagram 73 
6.29 4-Bit Multiplier with Periphery Multiplexer Per Bit Layout 73 
6.30 8-Bit Multiplier with Periphery Multiplexer Per Bit (a) Schematic (b) 74 
Layout 
6.31 2-Bit Multiplier with Peripheral Multiplexers Per Array Block 74 
6.32 2-Bit Multiplier with Peripheral Multiplexers Per Array (a) Schematic (b) 75 
Layout 
6.33 2-Bit Multiplier with Peripheral Multiplexers Per Array Simulation 75 
6.34 4-Bit Multiplier with Peripheral Multiplexers Per Array 76 
6.35 4-Bit Multiplier with Peripheral Multiplexers Per Array Layout 76 
6.36 8-Bit Multiplier with Peripheral Multiplexers Per Array (a) Schematic (b) 77 
Layout 
6.37 1-Bit TG Based Multiplier (a) Schematic (b) Layout 77 
6.38 1-Bit TG Based Multiplier Simulation Results 78 
6.39 2-Bit TG Based Multiplier Block Diagram 78 
6.40 2-Bit TG Based Multiplier Layout 78 
6.41 4-Bit TG Based Multiplier Block Diagram 79 
6.42 4-Bit TG Based Multiplier Layout 79 
6.43 8-Bit TG Based Multiplier (a) Schematic (b) Layout 80 
6.44 2-Bit 2's Complement Multiplexer Based Multiplier Block Diagram 82 
6.45 Internal Multiplexer (a) Schematic (b) Layout 83 
6.46 Internal Multiplexer Simulation Results 83 
6.47 2-Bit 2's Complement Multiplexer Based Multiplier (a) Schematic (b) 83 
Layout 
6.48 2-Bit 2's Complement Multiplexer Based Multiplier Simulation 84 
6.49 4-Bit 2's Complement Multiplexer Based Multiplier Block Diagram 84 
6.50 4-Bit 2's Complement Multiplexer Based Multiplier Layout 85 
6.5I  8-Bit 2's Complement Multiplexer Based Multiplier (a) Schematic (b) 85 
Layout 
6.52 2-Bit 2's Complement TG based Multiplier Block Diagram 86 
xv 
6.53 
6.54 
6.55 
6.56 
6.57 
6.58 
6.59 
6.60 
6.61 
6.62 
6.63 
6.64 
6.65 
6.66 
6.67 
6.68 
6.69 
6.70 
6.71 
7.1 
7.2 
7.3 
7.4 
7.5 
7.6 
7.7 
7.8 
7.9 
7.10 
7.11 
7.12 
A.1 
A.2 
A.3 
A.4 
A.5 
A.6 
A.7 
A.8 
A.9 
A.10 
A.11 
A.12 
A.13 
A.14 
2-Bit 2's Complement TG Based Multiplier (a) Schematic (b) Layout 
2-Bit 2's Complement TG Based Multiplier Simulation Results 
4-Bit 2's Complement TG Based Multiplier Block Diagram 
4-Bit 2's Complement TG Based Multiplier Layout 
8-Bit 2's Complement TG Based Multiplier (a) Schematic (b) Layout 
Improved Algorithm for Signed Multiplication 
Inverter (a) Schematic (b) Layout 
Inverter Simulation Results 
Multiplexed Half Adder Circuit Block Diagram 
Multiplexed Half-Adder (a) Schematic (b) Layout 
Multiplexed Half-Adder Simulation Results 
Block Diagram of Multiplier Circuit with NANO Capability 
Multiplier Circuit with NANO Capability (a) Schematic (b) Layout 
Multiplier Circuit with NANO Capability Simulation Results 
Improved 4-Bit 2's Complement Multiplier Block Diagram 
Improved 4-Bit 2's Complement Multiplier (a) Schematic (b) Layout 
Improved 4-Bit 2's Complement Simulation Results 
Improved 8-Bit 2's Complement Multiplier (a) Schematic (b) Layout 
Projection Characteristics for 130nm to 90nm Process 
Adder-Subtracter Block Diagram 
Full Adder Block Diagram 
Full Adder (a) Schematics (b) Layout 
Full Adder Simulation Result 
AND-OR (a) Schematics, (b) Layout 
AND-OR Simulation Result 
4-Bit Comparator (a) Schematics (b) Layout 
4-Bit Comparator Simulation Result 
4-Bit ALU Pixel Floor-Plan (-40µmx40µm) 
4-Bit ALU Pixel (-40µmx40µm) Using 0.13µm Technology 
Proposed 3-D Soft Chip Floor-Plan 
Physical Implementation of 3-D Beam Steering Switch 
AND Circuit Schematic 
Symmetric Full Adder Circuit Schematic 
1-Bit Symmetric Based Multiplier (a) Schematic (b) Simulation 
2-Bit Symmetric Based Multiplier (a) Schematic (b) Simulation 
4-Bit Symmetric Based Multiplier (a) Schematic (b) Simulation 
8-Bit Symmetric Based Multiplier (a) Schematic (b) Simulation 
Multiplexer based 1-Bit multiplier (a) Schematic (b) Simulation 
2-Bit Multiplexer based Multiplier (a) Schematic (b) Simulation 
4-Bit Multiplexer Based Multiplier (a) Schematic (b) Simulation 
8-Bit Multiplexer Based Multiplier (a) Schematic (b) Simulation 
Vertical Multiplexer (a) Schematic (b) Simulation 
Horizontal Multiplexer (a) Schematic (b) Simulation 
1- Bit Multiplier with Periphery Multiplexers Per Bit (a) Schematic (b) 
Simulation 
2-Bit Multiplier with Periphery Multiplexers Per Bit (a) Schematic (b) 
Simulation 
xvi 
86 
86 
87 
87 
88 
89 
90 
90 
91 
91 
91 
92 
92 
92 
93 
93 
94 
94 
99 
101 
102 
102 
103 
103 
103 
104 
104 
105 
105 
106 
107 
120 
120 
121 
121 
122 
122 
123 
123 
124 
124 
125 
125 
126 
126 
A. 15 4-Bit Multiplier with Periphery Multiplexers Per Bit (a) Schematic (b) 127 
Simulation 
A. 16 8-Bit Multiplier with Periphery Multiplexers Per Bit (a) Schematic (b) 127 
Simulation 
A. 17 2-Bit Multiplier with Peripheral Multiplexers Per Array (a) Schematic (b) 128 
Simulation 
A. 18 4-Bit Multiplier Circuit with Peripheral Multiplexers Per Array (a) 128 
Schematic (b) Simulation 
A. 19 8-Bit Multiplier Circuit with Peripheral Multiplexers Per 4-Bit Array 129 
Schematic 
A.20 1-Bit TG Based Multiplier (a) Schematic (b) Simulation 129 
A.2 1  2-Bit TG Based Multiplier (a) Schematic (b) Simulation 130 
A.22 4-Bit TG Based Multiplier (a) Schematic (b) Simulation 130 
A.23 8-Bit TG Based Multiplier (a) Schematic (b) Simulation 13 1 
A.24 Internal Multiplexer (a) Schematic (b) Simulation 132 
A.25 2-Bit 2's Complement Multiplier (a) Schematic (b) Simulation 132 
A.26 4-Bit 2's Complement Multiplier (a) Schematic (b) Simulation 133 
A.27 8-Bit 2's Complement Multiplier (a) Schematic (b) Simulation 133 
A.28 2-Bit 2's Complement TG Based Multiplier (a) Schematic (b) Simulation 134 
A.29 4-Bit 2 's  Complement TG Based Multiplier (a) Schematic (b) 134 
Simulation 
A.30 4-Bit 2's Complement TG Based Multiplier (a) Schematic (b) Simulation 135 
A.3 1  Inverter (a) Schematic (b) Simulation 135 
A.32 Multiplexed Half-Adder (a) Schematic (b) Simulation 136 
A.33 1-Bit 2's Complement TG Based Multiplier (a) Schematic (b) Simulation 136 
A.34 Improved 4-Bit 2's Complement Multiplier (a) Schematic (b) Simulation 137 
A.35 Improved 8-Bit 2 's Complement Multiplier (a) Schematic (b) Simulation 137 
XVll 
Chapter 1 
Integrated Silicon Systems 
1.0 Background 
During early 1980's the emergence of VLSI technology provided unimaginable 
opportunities in the realisation of novel silicon CMOS architectures, creating a paradigm 
shift in formulation and building of new systems. 
Throughout this evolution, the measure of progress has been determined by the number of 
devices per chip, the size of the chip and the process technology used within. The focus has 
been to produce smaller, faster, more reliable and less expensive systems which consume 
less power. It is expected that this trend will have matured somewhere between 2013 and 
2016 as shown in Tablel .1 [3,4,5]. 
Table 1 .1 Technology Road Map 
Year 1998 2001 2004 2007 2010 2013 2016 
Feature (nm) 180 130 90 70 50 35 20 
Voltage(V) 1.8 1.2 1.2 0.9 0.6 0.6 ? 
As a result of such progress in technology, multimedia, wired communication, wireless 
communication and networking technologies have become a significant aspect of research. 
Various technological implementations including that of silicon CMOS and the emerging 
photonics arena are constantly being challenged. The emergence of new and complex 
optical and photonics technologies with complex broadband connectivity and processing 
requirements for capture, conversion, compression, decompression, enhancement and 
display of high quality multimedia content place heavy demands on standard 2-D VLSI 
systems. Added requirements such as low power, almost "zero cost", software 
programmability, need for high volume production and low design time, compound the 
problem. The key to overcome these challenges lies in improvements in material and 
manufacturing processes as well as new approaches in architectures and design [6]. 
With the evolution of the Internet the demand for broadband access has reached remarkable 
heights. The current telecommunication technology is capable of providing many forms of 
data communication across fixed links including ISDN connections and the Internet. 
However, the area of mobile or wireless communication still has significant limitations due 
to bandwidth and hardware complexity constraints. The main problem in realising such a 
system is lack of suitable real-time image/video communication for small portable multi­
purpose devices that have sufficiently low power consumption and are cost effective. 
Furthermore, the new generation multimedia and telecommunication systems such as the 
3G (3rd Generation), the future 4G (4th Generation) wireless communications networks, and 
the evolving content-rich multimedia services require systems with higher computational 
1 
throughput together with low area, low power dissipation and real-time processing 
capability [7]. This allows typical applications of 3G and 4G such as, terrestrial positioning 
system, service for smart residential networks and smart tele-health infrastructures, to be 
implemented effectively. 
Figure 1.1 illustrates a generic road map showing the expectation from 3G and future 4G 
technologies. Conventional service providers face an immense challenge in trying to deliver 
new services due to limited availability of offerings as well as effectively higher costs for 
these services. 
Versitility of Content 
and User Benefits 
Text 
Picture 
Messaging 
Text& 
Graphics 
Multimedia 
Messeging 
Services 
Digital Image New Content Intelligent 
& Input Types Communication 
Figure 1.1 3G and 4G Capabilities 
?????? 
Time 
Standard 2-D VLSI technology is no longer able to satisfy these requirements. New and 
more capable technologies are required to cater for the next generation as well as future 
generations of mobile and high performance processing systems and networks. 
The tremendous growth in Internet traffic loads and the continually increasing demand for 
broadband services have generated a need for increased transmission capacity and 
switching system throughput. Fibre optics technology is a promising solution to 
transmission and switching complexities. Some important characteristics of optical 
interconnects are reduced system size, heat dissipation management, eliminated crosstalk 
resulting in parallelism, better interconnection management, speed and higher transmission 
distance coverage. However this technology will truly be successful if a number of 
challenges such as the realisation of an intelligent optical switch, can be overcome. 
As an example consider the optical interconnect link shown in Figure 1.2. The distance the 
signal can travel before a required refresh or amplification, is not very long i.e. 
approximately every 100 km. A conventional amplifier changes the received light from the 
previous segment to electric signal and after reshaping and amplifying the signal, converts 
it back to light for the next segment transmission. A laser transmits this optical signal to the 
next optical amplifier receiver. 
2 
Mobile 
Multimedia 
Smart 
Home$/ 
Hospitals 
.u'lblt'lit 
lntdllgence 
Electrical 
Signal 
Input
Lightwave •• �J���a."..e ••
Transmitter 
Lightwave ............
Electrical 
Signal 
Output
Li 
Figure 1.2 An Example of Optical Interconnection Link 
However, after several amplifications the signal-to-noise ratio becomes very small and the 
signal at the receiver end becomes prone to error, making this unsuitable for long distance 
transmission. Also the bandwidth capacity of the fibre itself is very large, transmitting and 
receiving terabits of information per second, hence better multiplexing capabilities are 
needed than what is available at the moment. 
The trends on transmission capacity and the core technology necessary for future 
generation internet, large scale optical network and core optical devices is illustrated in 
Figure 1.3. 
25 T 
lOT 
lOOG 
lG Commercialised 
System and 
Technology 
lOM:�,__�--,..��--��__,---��1--��---��---�--' 
1980 1990 2000 2010
Figure 1.3 Trends on Transmission Capacity Technology 
;3 � 
I") =­
:r O = -
0 0 -= 
0 -·
� I") 
� ("".)
I") ==- '"! = ... 
0 . -�
0 =
� -
To appreciate the significance of this research a brief outline of optical network and optical 
switch technology will be provided in the following section. 
1.1 Optical Switching 
The optical switch is the core of an Optical Cross Connect (OXC) which is able to accept 
various optical carrier rates. The system shown in Figure 1.4 consists mainly of Optical 
Switches (OSWs), Wavelength M:ultiplexers (WM:UXs) and Demultiplexers (WDM:Xs), 
Optical Pre-Amplifiers (Pre-OAs) and Post-Amplifiers (Post-OAs) [9]. Low loss, small loss 
variations, and low crosstalk characteristics are requirements for the optical switch 
necessary for constructing large-capacity OXC nodes. 
3 
I --------------, ---------------' I 
I ,_ - - - - - -
Monitorina & Control Unit 
Figure 1.4 A Generic Optical Switch Within the Framework of an Optical Network 
Figure 1.5 shows the concept for a Reconfigurable Optical Switch that uses Computer 
Generated Phase Holograms (CGH) recorded onto a Liquid Crystal (LC) Spatial Light 
Modulator (SLM), to steer the incoming beam to the designated output port [11,12]. 
Fourier Transform Lens 
Glass Substrate Opto-VLSI 
Refractor 
.... �
I,. f f 
Figure 1.5 The Physical Architecture for Configurable Optical Switch 
Phase Holograms are grating structures written onto arrays of pixels using liquid crystal 
electro-optic effects, and are designed to give phase-only modulation of the incident light. 
The basic operation of the switch entail the input fibres being arranged in an array from 
which the light beams are launched and collimated, through a Fourier lens onto the Opto­
VLSI chip. The Opto-VLSI acts as a phase-only diffraction grating with a reconfigurable 
pattern, setting the deflection angle of the reflected beam. By controlling the deflection 
angle (calculation of appropriate coefficients), the beam is made to reflect back into a 
second pixel array, where it is diffracted again onto the selected output port [11]. 
Figure 1.6 shows the floor-plan for the Opto-VLSI beam steering processor based on 
Liquid Crystal on Silicon (LCoSi) processor [16]. Pixels with dimensions of 5µm-1 Oµm are 
grouped into blocks that may range from 64x64 to 128x128 pixels. For example, a 64x64 
4 
O
utput F
ibers
 
---, 
.,. : I I I I 
----f -------------f -----
lnput F
ibers 
V
 
II 
OXC will consist of two 8x8 pixel blocks, where each pixel block acts on one beam of 
light. Every pixel is assigned one or more memory elements that store a digital value (1 
memory element for 2 phase array, 2 memory elements for 4 phase array and 3 memory 
elements for 8 phase array). A multiplexer selects one of the input voltages and applies it to 
the aluminium mirror which is usually the top metallisation level. The chip is software­
configured and is capable of controlling multiple fibre ports in one compact Opto-VLSI 
module. 
Pixel (5µm-10µm) 1OOx100 pixels 
Figure 1.6 The Opto-VLSI Processor Structure 
1.2 Opto-VLSI Optical Switch Implementation 
For convenience, Opto-VLSI chip architecture and the related cross section showing the 
relationship between the silicon and the Liquid Crystal is illustrated in Figure 1. 7 and 1.8 
[19). 
Indium 
Tin oxide 
ITO 
Quarter Wave Plate (QWP) 
Liquid Crystal (for Polarisation Independence) 
/ { 
Gloss Cove, 
'- l � Glue Seal Ring
� •:z -� 
I / L I 
_ 
7 
� 
Aluminium Mirror
I 
VLSI Backplane Package 
Figure 1.7 Cross-Section of Opto-VLSI Chip for Beam Steering 
5 
Opto-VLSI Processor 
Opto-VLSI Cell Structure 
VLSI Layer 
Si 
ITO 
(QWP) 
Figure 1.8 Opto-VLSI Processor and Implementation on Silicon 
1.3 Computer Generated Phase Hologram (CGH) 
Figure 1.9 illustrates the schematic representation of a 2-D holographic interconnect. Any input 
channel from LxL input plane can be connected with any channel of the LxL output plane. 
LxL Input 
Channels 
Collimating Lens 
Array LxL 
LxL Array of Transform LxL Output 
Sub-Holograms Lens Channels 
Figure 1.9 Schematic of a Generalised 2-Dimensional Holographic Interconnect 
The basic concept of a Computer Generated Hologram (CGH) is illustrated in Figure 1.10. 
The CGH design involves searching and optimising for a combination of the pixel states 
that form a desired pattern for optical beam steering. The far field pattern of the CGH is 
determined by the Fourier transform of the CGH. Opto-VLSI chip allows CGH to be 
implemented dynamically. 
6 
luminiumi 
furor 
Fourier Transform 
Figure 1.10 A CGH Pattern and its Fourier Transform 
For efficient calculation of the coefficients a processor is needed, which is in close 
proximity of the Opto-VLSI optical beam steering chip, that can quickly calculate the 
coefficients for the required hologram. In view of the number of existing algorithms and 
those emerging, the processor will be required to handle different word sizes and needs to 
provide for such functions as multiplication, addition, accumulation, and comparison 
operations - typical characteristics for various transforms [22, 23, 24]. 
1.4 Summary 
The discussions thus far highlight the demand being imposed on technology by new 
applications. It is without doubt that in the future photonic networks will create the basis of 
reliable and flexible Intelligent Broadband Internet systems, facilitating access to vast 
number of multimedia services. 
The requirements of such subsystems are low power, small size, capacity for rapid 
calculation and ability to be reconfigured to cope with future new algorithms. The 
motivation has been based on the implementation of an intelligent Opto-VLSI processor for 
photonic networks, this being the future platform for intelligent optical networks. In order 
to steer beams, appropriate coefficient calculations have to be carried through, off-line or 
for future intelligent systems, on-line. The chip needs to handle different word sizes and be 
able to provide for various arithmetic functions and operations to cater for different 
algorithms. 
The emergence of TDST (0.13µm and below) maybe considered as an "enabler" of new 
architectures that were previously considered impossible. 3-D Soft-Chip Technology (SCT) 
is one such solution that utilises TSDT to provide a new design platform for content-rich 
multimedia, telecommunication, advanced optical systems and networking. It effectively 
manipulates hardware primitives through integration of control and data. 3-D Soft-Chip 
technology creates a new opportunity for system architects to meet the demands imposed 
by various kinds of adaptive or intelligent systems and networks [25, 26]. 
In the next chapter a novel architecture based on 3-D architecture is outlined that can meet 
the demands of the proposed systems. The approach deviates from conventional 2-D 
architectures to 3-D systems through implementation of configurable primitives. 
Although detailed design of a complete 3-D system is well beyond the scope of this thesis, 
the research in a systematic manner will create an implementation strategy that will provide 
direction for future research and researchers for ultimate realisation of the system. Chapter 
7 
2 presents the concept of 3-D Soft-Chip technology and related 2-D to 3-D mapping. 
Chapter 3 will review logic styles providing a better insight into full custom 
implementation strategy. Chapter 4 will explore the options that are available in realising 
arithmetic primitives having capability for (i) variable word length (ii) programmability and 
(iii) dynamic reconfigurability. Chapter 5 examines the implementation of scalable 
multiplication primitives, and the design of Serial/Parallel multiplier. Scalable Parallel/Parallel Multiplier Design will be discussed in Chapter 6, followed by the proposed implementation strategy for Configurable ALU Architecture and the direction that 
technology will potentially follow in the future, outlined in Chapter 7. 
8 
Chapter 2 
3-D Soft-Chip Paradigm 
2.0 Summary 
Developments in Truly Deep Submicron Technology {TDST) and related progress in inter­
chip vertical connections such as Indium bump have provided new system solutions 
enabling transformation from conventional 2-D architectures to 3-D systems [3 , 4, 5, 28]. 
Future generation of 3-D sil icon CMOS will incorporate ultra-thin low power circuits, with 
complex interlayer interconnects, into a cube structure. This approach can provide a unique 
opportunity to address numerous challenges encountered in 2-D CMOS. 
2.1 Soft-Chip Technology 
A promising approach proposed in this research is based on 3-D Soft-Chip Technology, 
which entails integration of "Vertical Interconnect" with "Configurable Circuits" through 
Software Mapping. 
2.2 3-D Implementation Strategy 
Implementation of the 3-D Soft-Chip is described in term of three related domains: 
• VLSI chip with basic primitives arranged in a variety of configurations capable of 
being stacked vertically; 
• Vertical interconnection mechanism through materials such as Indium bump; and 
• Software configuration. 
This concept is depicted in Figure 2 . 1 .  
Figure 2.1 3-D Soft-Chip Platform 
The 3-D Soft-Chip architecture is driven by three factors: 
• Development of new device technology TDST which can support new architectures 
with complexities of 1 OOM to 1000M devices arranged as "Intelligent Sea-of­
Pixels" array; 
9 
• Development of advanced wafer bonding techniques such as Indium bump and the
more futuristic optical interconnects, to implement vertical interconnection;
• Improving the performance of silicon CMOS systems as devices continue to scale
down in dimensions.
Optimum performance of 3-D Soft-Chip based systems are primarily established by the 
choice of basic primitives and architecture as well as the interconnecting network. For 
example, 2-D supetposition of devices into 3-D can decrease interconnect delays by up to 
60% relative to a similar planar architectures [1, 2]. Similarly, clever architectures allow 
new flexibility as well as optimised chip area and reduced power consumption. 
2.3 2-D to 3-D Configurable Array Architectural Transformation 
Figure 2.2 illustrates the architecture for a Configurable Array Processor (CAP). 
--------- ..... ------
... 
, ..... i. ):::;· i:. Mulhpil'r .,.. 
f.1){11( • • 
PE L0fl'h-' 
... ·:• •
. ... • 
.... • 
lntdh1,t!!nl • : 
... Confi�urable PE L,,.,c 
lntelhgenl • 
, Con!iguroble .... • s .. itch 
(!CS) 
: • Switch 
�, , • nCS> 
1---- --,1 � l ••• � 
Figure 2.2 2-D Architecture for Configurable Array Processor 
The processing elements, namely the Pixel Elements (PE) contain arithmetic and logical 
functional blocks such as adders and multipliers as well as control logic, storage and buffer 
elements, to support variable word lengths computations ( 4-bit to 32-bit required for image 
and signal processing tasks in content-rich multimedia applications and computation of 
coefficient for CGH). 
The ALU should be configurable, area efficient and provide for a number of arithmetic and 
logical operations. It should also provide support for a relatively broad instruction set. The 
communication strategy enables each ALU to communicate with all adjacent ALUs 
through the Intelligent Configuration Switch (ICS). The ICS serves as a cross-point switch 
as well as a large localised storage. 
10 
Transformation of the planar 2-D architecture of Figure 2.2 into the 3-D architecture is 
illustrated in Figure 2.3. 
ootttJ .. otJ 
: (ICS) : . . . . ...... .. .. ... . . . . , .........•. . . . . . . . 
: (ICS) : . . . . ............
. , ......... . . . . . . . . 
. ......... ..
. ............. . . : 
Figure 2.3 Mapping of Single 2-D Chip into Two Vertically Integrated 3-D Chips 
The upper chip (ICS) takes the form of a massively parallel cross-point switch as well as 
parallel interconnected buffer memory allowing for very high speed data manipulation 
within the plane. The lower chip (CAP), a highly parallel array of soft programmable 
processors, is capable of carrying out complex processing tasks directly on data stored 
either in the top plane through the Indium bump interconnects [20, 29, 30], or within the 
CAP plane. Each of the processors includes its own embedded storage, along with an ALU 
and instruction decoder. Software programmed instructions are forwarded globally to all 
processors from on-chip RAM. Transforms and other processing tasks may be carried out 
according to embedded software instructions on the highly parallel "Intelligent Sea-of­
Pixel" array, allowing for very high speed data manipulation and throughput at low clock 
speeds. 
2.4 Indium Bump Vertical Interconnects 
The transformation maps the planar 2-D CAP chip architecture into two vertically 
integrated chips, namely the upper ICS chip and the lower CAP. The two chips are flipped 
and connected through their top metallisation layers with low temperature processing such 
as Indium bumps illustrated in Figure 2.4 [29, 30, 32]. Indium is an ideal material to use as 
interconnect metal. It has excellent adhesion to most metals, including gold and aluminium, 
which is the pad metallisation medium. It also has low melting point which easily facilitates 
the bonding on processed VLSI wafers. Indium provides an excellent mechanical and 
electrical connectivity with very high bandwidth (high speed), and very low 
inductance/capacitance (low power) connection between the two chips. 
11 
(1CS) 
Intelligent Configurable 
Switch (JCS) 
� 
Figure 2.4 3-D Soft-Chip Physical Architecture 
The process flow for Indium bump is shown in Figure 2.5. The general order in process 
flow includes: (a) oxidation, (b) aluminium pad patterning, (c) photoresist coating and 
patterning, ( d) Ti/ Au/In evaporation, ( e) lift-off and (f) reflow. 
(a) 
(t) 
:i:i:�,ir 
·, ,
(b) 
(e) 
Figure 2.5 Indium Bump Process Flow 
(
c
) i 
(d) 
A basic structure of an Indium bump having a dimension of 15µm is shown in Figure 2.6 
[32]. 
Figure 2.6 Single 15µm Indium Bump after Reflow 
One of the major targets in electronic packaging is to decrease cost and increase the 
packaging density, while maintaining if not improving the performance and reliability of 
the circuit. Implementing Indium bump interconnects to bond the upper and the lower chip 
achieve these requirements significantly. Within the 3-D soft-chip, massive array of Pixel 
Elements (PE) are connected through vertical channels that creates a path between the 
upper chip and lower chip under software control. As the two chips are flip-bonded and 
packaged, the size of the system is significantly reduced. 
2.5 3-D Soft-Chip Architecture 
A generic floor-plan of the ALU as part of the Intelligent Sea-of-Pixels array architecture 
within the CAP Chip is illustrated in Figure 2.7. 
12 
, ......• , ......• 
• • 
•······ 
• 
Adder 
} Multiplier ALU 
Logic 
>---��I--+ l Reg 
R2 
Generic n-bit Pixel 
Processor 
Figure 2.7 Floor-Plan of CAP Chip (Lower Chip) 
There are two levels of hierarchy within the CAP architecture to facilitate configuration of 
the ALU's word length. The first level utilises four processors and one ICS and at the 
second level this basic group communicates with immediately adjacent groups. High 
efficiency bus architecture provides the interconnection between the parallel array 
processors for rapid extraction or insertion of data. Figure 2.8 illustrates the proposed 
interconnecting bus architecture. 
4-Bit Indium Bump Bus
Interconnect to ICS �
lSµm 
lSµm 
Internal 
Bus 
..- Global Bus 
Figure 2.8 Interconnect Strategy for 3-D Soft-Chip 
Memory in the upper chip (ICS) is directly addressable from the CAP chip, alleviating the 
temporary memory requirements for processing as well as minimising data transfer tasks. 
Addressing on the array level in the system will follow switched bus architecture. 
Establishing the optimum number of bits to associate with basic computational elements 
such as adders and multipliers suitable for word-length expansion is important to realise 
system flexibility without compromising performance. 
13 
• 
Indium Bump MetaJli ation Pad 
� /  
2.6 Configurable and Scalable ALU Cell 
The algorithms required to be performed by the ALU include various form of transforms 
and optimisation methods and make use of the following arithmetic functions: 
• Addition,
• Accumulation,
• Multiplication,
• Multiplication and Addition,
• Comparison.
A requirement of the ALU cell is that it must be run-time configurable and must be able to 
accommodate different input data word lengths. Although the intension has always been to 
create a generic I-bit primitive that is capable of scaling and in addition is capable of being 
configured to perform multiplicity of arithmetic function, our early investigation 
highlighted the significant redundancy and overhead costs associated with embedded 
configuration logic. Therefore a generic 4-bit primitive will be considered as the base for 
design. 
2.6.1 Generic 4-Bit ALU 
The logical architecture for a generic 4-bit ALU, capable of performing several different 
operations is shown in Figure 2.9 [43]. 
00 
..!! 
4-Bit Shift e ui 
0 Register 0 
i: u 
0 0 i-, u 
�-------------------------- 1 
Figure 2.9 Logical Architecture for a Generic 4-Bit ALU 
The ALU consists of five parts: 
• Arithmetic-Logic circuit (AL), where various functions are implemented through
the control signals. Based on these the device can perform logic or arithmetic
operations on individual bits.
• Shift register, consisting of flip-flops and necessary related logic circuits. Signals
applied to inputs determine the type of operation performed.
• Output registers Fl and F2, used to store the bits being shifted out for later usage.
• External bus for communication with ICS chip.
• Localised internal bus for inter-pixel communication.
14 
4-Bil Parallel Input 
i- - - - - - - - - - - -.-- - - - - - - - l  
I 
I 
Select Inputs 4-Bit AL Circuit I : 
I 
C 
The ALU must perform arithmetic functions in accordance with instructions from the upper 
chip (ICS) via Indium bumps. A number of issues have to be addressed in the process of 
designing the ALU. These are presented in the order of their complexity and are 
summarised as follows: 
• Choice of adder design style - identification of a suitable class of logic and basic
primitives that can be replicated;
• Choice of multipliers - serial/parallel approach vs. parallel/parallel approach -
configurability and occupying minimum chip area is the imposed constraint;
• Minimisation of overhead used to support reconfiguration;
• Layout using 0.13µm technology - needed to accommodate minimum chip area
requirements;
• Design of complete ICS and CAP chips and related assembly through Indium
bumps; and
• Compiler design.
2.7 System Design Cycle 
Full custom design, which is the design style used for this research program, requires 
careful consideration of many issues relating to design and layout. The review of literature 
has highlighted significant issues that relate to successful design. These can be defined in 
terms of system design cycle [2]. The design process for any system components has to be 
performed at algorithm level, architectural level, logical level and circuit or layout level 
shown in Figure 2.10. The approach highlights the steps that will be followed to arrive from 
fundamental problem at hand to realisation of the physical layout. 
2.8 Conclusions 
Arithmetic 
Operation 
Number 
System 
� ,/ 
Algorithm 
+ 
Architecture 
+ 
Floor-Plan 
+ 
Logic Gates 
Figure 2.10 System Design Cycle Flow Chart 
The concept of 3-D Soft-Chip Technology provides effective solutions for system 
integration by manipulating the functionality of hardware primitives through vertical 
integration of two 2-D chips. The system can be made highly flexible due to the 
programmable nature of the CAP and efficient due to the vertical interconnects (Indium 
bump) and highly parallel configurable architecture. The architecture is also capable of 
highly complex real-time processing tasks. These features make the system ideal for 
content-rich mobile multimedia type applications and calculation of coefficients for 
15 
holographic beam steering. To realise such concept, various arithmetic primitives have to 
be implemented, as these are the most important components of the system. Such primitives 
have to incorporate configurability as part of the Sea-of-Pixel reconfigurable array, and be 
designed with minimum dimension and power dissipation. 
Establishing the optimum number of bits to associate with basic computational elements 
such as adders and multipliers suitable for word-length expansion is very important. This 
has to be accomplished to realise system flexibility within primitive components of adders 
and multipliers without compromising performance or substantially increasing overhead or 
bit redundancy during complex computations. Although the issues of ICS and CAP chip 
designs and related fabrications together with the compiler design are critical component of 
the 3-D Soft-Chip architecture, the complexity and allocated time-line is far beyond the 
scope of this thesis and therefore they will not be addressed in this research. However they 
provide interesting and fertile platform for future research as part of continuation of the 
research program. Hence in the remaining chapters of this thesis attention will only be 
directed to the issues that underpin final design of CAP chip such as the options in logic 
styles and related configurable arithmetic primitive. 
16 
Chapter 3 
Evaluation of Adder 
and 
Multiplier Designs 
3.0 Summary 
Various options are available for implementation of arithmetic primitives such as adders 
and multipliers. The goal of this research is to use a single primitive which can be 
configured through simple replication, and realise arithmetic primitives capable of 
configuring into variable word lengths with low dissipation and minimum chip area. The 
task in the following sections is to consider various aspects of these circuits and decide on 
their suitability for architectural and physical mapping. 
This Chapter will study some of the options available in the literature for the design of 
adders and multipliers. 
3.1 Adders [1]  
Addition can be viewed in terms of generate G[i], and propagate P[i], signals where c [i] is 
the carry-out signal from stage, [i- 1] , equal to the carry in of stage (i), hence: 
c[i] = Co [i-1 ]  
For an adder the carry-in to the first stage (stage zero), Ci[O] = 0 .  Hence, 
G[i] = a[i] x b[i] 
P[i] = a[i] EB b[i] 
c[i] = G[i] + P[i] x c[i-1] 
s[i] = P[i] EB c [i-1] 
This means that a basic primitive can be realised by an Exclusive OR gate . 
3.1.1 Ripple Carry Adder (RCA) [44, 45, 55] 
Figure 3.1 shows a conventional Ripple Carry Adder (RCA). The delay of an n-bit RCA is 
proportional to n and is defined by delay associated with the propagation of the carry signal 
through all of the stages. 
17 
a; b; 
e;+l 
Figure 3.1 Block Diagram of a Ripple Carry Adder (CRA) 
The disadvantage of using a RCA is that every stage has to wait to make its carry decision, 
c[i], until the previous stage has calculated c[i-1]. 
3.1.2 Carry-Save Adder (CSA) [l, 45] 
A carry-save adder has inputs ai, b, and Si, and output, s0, as shown in figure 3 .2. 
ai Register 
b � --T;--i.---,-1. b
Figure 3.2 Block Diag.-am of a Can-y Save Adder (CSA) 
The input Ci, is the carry from stage (i-1 ), with Ci[O]=O. The output c0, is the carry out to 
stage (i+ 1 ). A 4-bit CSA implementation is shown in Figure 3.3. 
18 
- - - - - �  
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
Figure 3.3 4-Bit Carry-Save Adder 
In a CSA the carries are "saved" at each stage. There is thus no carry propagation hence the 
delay of a CSA is constant. At the output of a CSA all the saved carries need to be added to 
all the sums to obtain an n-bit result. In this way the n-bit sum is encoded in output s0 . 
By incorporating registers between stages of combinational logic we use pipelining to 
increase the speed. The drawback is increased area (addition of registers) and latency 
(latency is equal to n clock cycles for an n-stage pipeline). It takes a few clock cycles to fill 
the pipeline, but once it is filled, the results emerge each clock cycle. 
3.1.3 Carry Bypass Adders (CBA) [47, 55] 
Carry Bypass Adders (CBA), bypass carries for a critical set of bits to speed up 
computations. For example, to bypass the carries for bits 4-7 of an adder we can compute: 
BYPASS= P[4].P[5].P[6].P[7] 
A multiplexer can be used to perform the bypass as follows. For example c[7] is realised as: 
c[7] = (G[7] + P[7] x c[6]) x BYPASS'+ c[3] x BYPASS 
CBA' s compute the carry in two different paths, hence they usually include redundant 
logic, resulting in larger chip area utilisation. 
3.1.4 Carry Skip Adder (CKA) [47, 55] 
In CK.A algorithm instead of checking the propagate signals, the inputs are checked. For 
example we can compute: 
SKIP= (a[i-1] E9 b[i-1]) + (a[i] E9 b[i]) 
Then use a 2:1 multiplexer to select c[i]. Hence: 
csKIP[i] = (G[i] + P[i] x c[i-1]) x SKIP'+ c[i-2] x SKIP 
Carry skip adders also may include redundant logic since the carry is computed in two 
different ways. 
3.1.5 Carry Look-Ahead Adder (CLA) [l, 45, 47] 
As before the carry equation is: 
c[i] = G[i] + P[i] x c[i-1] 
19 
I i-- - - - : ... So, 
Evaluating this equation recursively for i = 1: 
c[l] = G[l] + P[l] x c[O] 
= G[l] + P[l] x (G[O] + P[l] x c[-1]) 
= G[l] + P[l] x G[O] 
This implies that it is possible to "look-ahead" by two stages and calculate the carry into the 
third stage (bit 2), which is c[l ], using only the first-stage inputs ( calculating G[O]) and the 
second-stage inputs. Expanding further: 
c[2] = G[2] + P[2] x G[l] + P[2] x P[l] x G[O] 
c[3] = G[3] + P[2] x G[2] + P[2] x P[l] x G[l] + P[3] x P[2] x P[l] x G[O] 
A 4-bit Carry Look-Ahead Generator Cell (CLG) block is shown in Figure 3.4. 
----------------, 
31 
I G1 
----------------, a,--....� t-----, .. G:, 
I 
! 0,P, ! 
c, �----�p.s., 
Figure 3.4 Carry Look-Ahead Generator Cell (CLG) 
The shortfall of this implementation is that as look-ahead progresses further the equations 
become very complex, take longer to calculate, and the logic becomes less regular when 
implemented using cells with a limited number of inputs. Also the fan-in and the fan-out 
become very large. 
3.1.6 Carry Select Adder (CSA) [45, 47] 
In this algorithm, two small adders (usually 4-bit or 8-bit CLA adders) are duplicated for 
ci=O and ci=l. Then a multiplexer is used to select the case which is needed. CSA's are fast, 
however they are not area efficient. 
20 
3.1.7 Brent-Kung Adder [47] 
The Brent-Kung adder reduces the delay and increases the regularity of a CLA scheme. A 
4-bit Brent-Kung CLA is illustrated in Figure 3.5.
oro1-
Pro1- BKA oro1-
Pro1-
BKA 
cf21 
Gr31 
Pr3l BKA 
Figure 3.5 Brent-Kung Carry Look-Ahead Adder 
d31 
Propagate and carry terms are formed from the inputs to the adder. The output of the look­
ahead logic is the carry bit that together with the inputs, form the sum. With the Brent­
Kung adder the delay from the inputs to the outputs are more equivalent than in other 
adders, hence the number of unwanted and unnecessary switching events is reduced which 
in tum leads to reduction in power dissipation. However implementation as a configurable 
component is not a straight-forward design option. 
3.2 Multiplication Algorithms 
An n-bit multiplier, multiplies two n-bit numbers. Each partial product is added to other 
partial products as well as relative carries from previous stages. The following, outlines the 
multiplication operation for two 3-bit binary numbers. 
a2 a1 3-0 
b2 b1 bo 
C6 c., c., C2 C2 CJ - C5 - C3 - a2bo CJ a1bo aobo 
C5 a2b1 C3 a1b1 - 3-0b1
a2 b2 - a1b2 - aob2
a2 b2+ a2b1+a1b2+ a2bo+a1b1+ a1bo +aob1 aobo 
C6 C4+C5 C2+C3 aob2 + CJ 
PS P4 P3 P2 Pl PO 
In the following sections various implementations of multiplier architectures and 
algorithms will be briefly outlined highlighting the merits or otherwise for implementation. 
Some classic power reduction strategies used within some of these multipliers 
configurations include: 
• reduction of supply voltage and clock speed,
• pipelining of operations,
• tuning of input bit-patterns to reduce switching,
• transforming the coefficient binary representation to minimise the computations
needed,and
• re-ordering the sequence of multiply-and-accumulate operations.
21 
3.2.1 Multiplier Design Through Coefficient Optimisation [49] 
This architecture implements multiplication using a set of coefficients. The original 
coefficients are scaled, enabling each multiplication to be partitioned into a collection of 
smaller multiplications in parallel, creating a shorter critical path than the original 
coefficient. The multiplier rows that correspond to multiplications by zero are disabled. 
Hence these do not affect the multiplication's final outcome, which leads to energy savings 
and reduced power dissipation. 
Further power dissipation is reduced by incorporating deactivation circuitry and bypass 
logic to omit unnecessary switching of the adder cells. Figure 3.6 illustrates a modified full­
adder cell implemented through coefficient optimisation. 
cntrl 
a; bi <>'-I 
FA +---+--•co 
Figure 3.6 Modified Full Adder Cell 
The two multiplexers transfer the partial product terms to bypass the adders when the 
corresponding bits are zero. Tri-state buffers are incorporated at the inputs of the adder 
cells to avoid any switching activity in the cells which are bypassed. 
Figure 3. 7 shows the average power dissipation of the modified array multiplier. The line 
labelled "Normal" depicts the average power dissipation of a 12-bit conventional array 
multiplier for comparison purposes. As can be observed, the modified array multipliers 
generally have lower average power dissipation. When the coefficients are optimised 
through partitioning, this architecture can result in significant power savings, even up to 
two times more energy efficient than the conventional array structure. 
The drawback of this architecture is that the individual circuit elements run slower, and 
circuit performance degrades. Also bypassing certain circuit elements will pose problems 
for configurability of the overall cell. Deactivation circuitry and bypass logic used within 
the cell will increase chip area, as well as contributing to software overhead involved to 
activate such circuitry. 
22 
l- u 
I I I I I I I I I I 
I I I I I I I I I t 1.9 ---1----t---...l---.J.---'----'----�---t----1--- ----I I I I I I I I I I 
I I I I I I I I I I 1.8 ---:----:---�---{---}---}---�---:--- .--- _: __ _ 
I I I I I I I I 
[l 0.7 
p.. 0.6 
� 0.5 
J 04 0.3 
0.2 
I ........ I 
I I I _,...,.,.. I I I 
""': - - - - - - r - - - ..,.------ r - - -,- - - -,- - - , - - -
I I I I I I I I 
; ........ , I I I I I I I 
- - .....; - - - -t - - - -f - - - � - - - r - - - � - - - t- - - -1- - - -t - - -
I I I I I I I I 
I I I I I I I I I 
Q. l - - -'- - - -'- - - .J - - - .J. - - - 1. - - - 1.. - - - L - - - L - - -1- - - ..J - - -
I I I I I I I I I I 
I I I 
I I I L2 14 1.6 1.8 2.0 2.2 24 2.6 2.8 3.0 3.2 
Supply Voltage (V) 
-Normal 
4 Clusters 
-5Clusters 
- 6Clusters 
Figure 3.7 Power Dissipation of Modified Array Multiplier as a Function of the 
Supply Voltage (0.35µm CMOS Process Technology) 
3.2.2 Tree Based Multiplier [1, 47] 
A Tree based arrangement is illustrated in Figure 3.8. 
KEY 
- Summand Inputs
Figure 3.8 Tree Based Multiplication 
The outputs of each stage are the inputs to the next stage. At each stage there are three 
options, add three outputs using a full adder, add two outputs using a half adder, or pass the 
outputs directly to the next stage. The algorithm objective is to choose one of these options 
at each stage to maximise the performance of the multiplier. 
3.2.3 Wallace-Tree Multiplier [47] 
Wallace-tree multipliers implement tree based multiplication. The top section of the 
multiplier, the carry-save section, consists of 26 adders, (6 of which are half adders), and 
the bottom section of the multiplier, the carry-propagate section, consists of 4 adder cells. 
Figure 3.9 shows a 6-bit Wallace-tree multiplier. 
23 
I t I I I I I - - ,- - - , - - - , - - - , - - - r - - - r - - - r - -
KEY 
- Summand 
Inputs 
Summands 
P11 P10 P9 Ps P1 P6 Ps P4 
Figure 3.9 A 6-Bit Wallace-Tree Multiplier 
2 
5,6 
7 
Working downward from the multiplier inputs, the number of signals to be added at each 
stage are compressed. A full adder may be considered as a 3:2 compressor or (3, 2) counter, 
i.e. it counts the number of 'l's on the inputs. For example an input of '101' (two 'l's) results
in an output '10' (2). To form P5 in Figure 3.9, 6 summands and 4 carries must be added
from the P4 column. These are added in stages 1-7, compressing from 6:3:2:2:3:1:1. The
last carry from column P4 is added in stage 5. The maximum delay through the carry save
section of Figure 3.9 is 6 adder delays as well as the delay of the 9 inputs carry propagate
section of stage 7.
Wallace-tree algorithm is a relatively fast algorithm, however it lacks the ability to be 
configured. Configurability implies a degree of uniformity of components and their style of 
interaction with other elements within the array. The structure of the design and element 
interactions does not follow a uniform pattern. The presence of half adders mean that all the 
components within the multiplier are not uniform, hence to make the array configurable, 
additional circuitry are needed to create consistency within the array as well as facilitating 
communication between adjacent cells. 
3.2.4 Dadda Multiplier [45, 47] 
In a Dadda multiplier, each stage has a maximum of 2, 3, 4, 6, 9, 13, 19, . .. outputs, where 
each successive stage is 3/2 times (rounded down to an integer), larger than the previous 
stage. For example, for a 6-bit Dadda multiplier, illustrated in Figure 3 .10, 3 stages with 3 
adder delays, plus the delay of a 10-bit output carry propagate section is required. 
24 
Carry 
Save--­
Section 
-l I ��---1�----l�------------------
I I 
Figure 3.10 6-Bit Dadda Multiplier 
Dadda multipliers are usually faster and smaller than Wallace-tree multipliers. The carry­
save section requires 20 adders. The carry-propagate section is a ripple-carry adder (RCA). 
Comparing the Dadda and the Wallace-tree multipliers, the CSA component of the Dadda 
multiplier is smaller (20 vs. 26 adders), faster (3 adder delays versus 6 adder delays), and 
more regular than the CSA of the Wallace-tree multipliers. However the overall speed of 
this implementation is approximately the same as the Wallace-tree multiplier. As with the 
Wallace-tree multiplier, the major problem of this configuration is the overhead for creating 
configurability. 
3.2.5 Booth Encoding Multiplier [47, 50] 
We can encode any binary number, B, as a signed CSD vector, D, where there is only one 
CSD vector for any number. To recode a binary number B, as a signed CSD vector D, we 
have: D i = Bi+Ci-2Ci+1 
where Ci+1 is the carry from the sum ofBi+i+Bi+Ci , and Co = 0. As another example, ifB =
011 (B2 = 0, B1 = 1, Bo = 1; decimal 3), then: 
Do = Bo+Co-2C1 = 1 +0-2 = 1, 
D1 = B1+C1-2C2 = 1+1-2 = 0, 
D2 = B2+C2-2C3 = 0+1-0 = 1, 
so that D = 101 (decimal 4-1 = 3). 
Consider multiplying an 8-bit binary number, A, by B = 00010111 (decimal 16+4+2+1= 
23). It is easier to multiply A by the CSD vector ofB, i.e. D = 00101001 (decimal 32-8+1 
= 23 ). This requires only three add or subtract operations, hence B has a weight of 4 and D 
has a weight of 3. By using D instead of B we have reduced the number of partial products 
by 1. Encoding can be implemented using a radix other than 2. Suppose B is an (n+ 1 )-digit 
2's complement number, 
B = Bo+B12+B22
2+ ... +Bi+ ... +Bn- 12
°-1-Bn2
° 
We can rewrite the expression for B as follows: 
2B - B = B = -Bo+(Bo - B1)2 + ... + (Bi-1 - Bi)i + . . .  + Bn-1 2 
n-i - B02
°
= {-2B1+Bo)2° +(-2B3+B2+B1)2
2 + ... +(-2Bi+Bi- 1+Bi-2)2
i-i +(-2Bi+2+Bi+1+Bi)i
+1 
+ ... +(-2B0+Bi-1+Bi-2)2 
n-l equation 3.1 
25 
Carry 
Save--­
Section 
-l I ��---1�----l�------------------
I I 
Figure 3.10 6-Bit Dadda Multiplier 
Dadda multipliers are usually faster and smaller than Wallace-tree multipliers. The carry­
save section requires 20 adders. The carry-propagate section is a ripple-carry adder (RCA). 
Comparing the Dadda and the Wallace-tree multipliers, the CSA component of the Dadda 
multiplier is smaller (20 vs. 26 adders), faster (3 adder delays versus 6 adder delays), and 
more regular than the CSA of the Wallace-tree multipliers. However the overall speed of 
this implementation is approximately the same as the Wallace-tree multiplier. As with the 
Wallace-tree multiplier, the major problem of this configuration is the overhead for creating 
configurability. 
3.2.5 Booth Encoding Multiplier [47, 50] 
We can encode any binary number, B, as a signed CSD vector, D, where there is only one 
CSD vector for any number. To recode a binary number B, as a signed CSD vector D, we 
have: D i = Bi+Ci-2Ci+1 
where Ci+1 is the carry from the sum ofBi+i+Bi+Ci , and Co = 0. As another example, ifB =
011 (B2 = 0, B1 = 1, Bo = 1; decimal 3), then: 
Do = Bo+Co-2C1 = 1 +0-2 = 1, 
D1 = B1+C1-2C2 = 1+1-2 = 0, 
D2 = B2+C2-2C3 = 0+1-0 = 1, 
so that D = 101 (decimal 4-1 = 3). 
Consider multiplying an 8-bit binary number, A, by B = 00010111 (decimal 16+4+2+1= 
23). It is easier to multiply A by the CSD vector ofB, i.e. D = 00101001 (decimal 32-8+1 
= 23 ). This requires only three add or subtract operations, hence B has a weight of 4 and D 
has a weight of 3. By using D instead of B we have reduced the number of partial products 
by 1. Encoding can be implemented using a radix other than 2. Suppose B is an (n+ 1 )-digit 
2's complement number, 
B = Bo+B12+B22
2+ ... +Bi+ ... +Bn- 12
°-1-Bn2
° 
We can rewrite the expression for B as follows: 
2B - B = B = -Bo+(Bo - B1)2 + ... + (Bi-1 - Bi)i + . . .  + Bn-1 2 
n-i - B02
°
= {-2B1+Bo)2° +(-2B3+B2+B1)2
2 + ... +(-2Bi+Bi- 1+Bi-2)2
i-i +(-2Bi+2+Bi+1+Bi)i
+1 
+ ... +(-2B0+Bi-1+Bi-2)2 
n-l equation 3.1 
25 
As an example considerB = 101001 (decimal 9-32 = -23, n = 5), 
B=101001 
=(-2B1+Bo)2 ° + (-2B3+B2+B1)2
2 + (-2Bs+B4+B3)2 4((-2x 0)+1)2 ° + ((-2xl)
+0+0)2 2 + ((-2x1)+ 0+1)2
4
Hence B is encoded as a radix-4 signed digit, E=121 (decimal -16-8+ 1 = -23). To multiply 
by B encoded as E we only have to perform a multiplication by 2 (a shift) and three 
add/subtract operations. Using the related equation we can encode any number by taking 
groups of three bits at a time and calculating: 
Ej =-2Bi + Bi-1 + Bi-2
Ej+1 =-2Bi+2 + Bi+1 + Bi, ... 
where each 3-bit group overlaps by one bit. We pair B with a zero, Bn ... B 1 Bo 0, to match 
the first term in equation 3 . 1. If B has an odd number of bits, then we extend the sign, Bn
Bn-1 ... B1 Bo 0. This algorithm reduces the number of partial products by a factor of two. 
As a result there is an increased in speed of the multiplier. The drawback is the overhead 
caused by the complex manipulation of the binary digits, especially when dealing with 
higher order numbers. The circuitry needed to implement the encoding will add to the 
overall chip size. Also due to the complex nature of the algorithm, configurability may be 
problematic. 
3.2.6 Braun Multiplier [ 4 7] 
The Braun Multiplier computes all the partial products of the bit pairs from the binary 
numbers a and b in parallel, and then adds the appropriate terms using cascaded full adders. 
As an example, the block diagram for a 3-bit Braun multiplier is illustrated in Figure 3.11. 
•, a, 
bo 
aobi a,,b, aobo 
b, 
PO 
a1b2 a1b1 a,bo 
Co c, 
FA 0 
Figure 3.11 3-Bit Braun Multiplier Block Diagram 
The Braun Multiplier is a promising solution for realising 3-D soft-chip architecture. The 
placement and flow of the components within the structure is relatively uniform, making 
this algorithm configurable. One drawback of this design is the absence of specialised 
2 6
algorithm to exclude redundancy, however adding such circuit to the structure will increase 
the chip area, which is not desirable for the purpose of this research. Through careful 
selection of adder design style to implement the multiplier cells, area optimisation may be 
achieved. 
3.3 Configurable Multiplier Array 
Implementing a configurable style of architecture necessitates: 
• Uniformity of component design styles,
• Uniformity and configurability of array structure and,
• Capability of an array to interact with adjacent arrays within the system.
In the following section, some configurable architectures are presented for binary 
multiplications. 
3.3.1 Simple Chain Array Algorithms [ 45] 
Figure 3.12 illustrates a simple 4x4 chain array multiplier similar to the Braun multiplier 
algorithm. 
a, a, 
So s� Su s� 
!:\, 
�-� 
co 
PO 
b,. 
�-
co 
Pl 
bi 
�--
co, 
P2 
I>., 
1 
Mulliplier Multiplier Muliplicr Multiplier 4--- CiJ 
iP6 iPs i P4 iP3 
Figure 3.12 4x4 Chain Anay 
In this structure the structure of elements within the array, are uniform, but interaction with 
similar adjacent arrays need to be implemented through hard wiring of the inputs and the 
outputs. 
3.3.2 Array Design with Periphery Multiplexer Per Cell 
Figure 3.13 illustrated a 4x4 Array design, in which each array element interacts with its 
adjacent element through cell periphery multiplexers [51]. Also the perimeter elements 
27 
interact with adjacent arrays through the same multiplexers_ The control signals of the 
multiplexers select the flow of information between elements. 
a,trhJ 
- ---------1p4 _: 
P3 
c,, 
Co 
Figure 3.13 4x4 Array Design with Periphery Multiplexer Per Cell 
This structure is highly configurable, but the multiplexers increase the power dissipation 
and the silicon area occupied. 
3.3.3 Array Design with Periphery Multiplexer Per Array 
In this design, illustrated in Figure 3.14, multiplexers are placed on the outside periphery of 
the basic array instead of each multiplier cell, to allow communication with adjacent arrays 
[51]. 
,., 
si3 a3 si2 a2 
s., 
sil al siO a0 
c,, 
,., 
b2 
c,, 
b3 
CJ 
Figure 3.14 4x4 Array Design with Periphery Multiplexer Per Array 
28 
This architecture is more area efficient, as it has reduced the number of multiplexers within 
each array, by a large number. At the same time the structure has maintained its 
configurability characteristic. 
3.4 Conclusions 
This Chapter provided a glimpse of numerous choices that are available for implementation 
of adders and multipliers as part of the "Sea-of-Pixels" architecture for the 3-D chip. 
The multiplier design through coefficient optimisation creates a shorter critical path, 
reduces switching activity, and decreases power dissipation through coefficient clustering 
and partitioning. It also promotes energy saving by disabling redundant paths, 
implementing deactivation circuitry and utilising bypass logic. However the drawback of 
this architecture is that the individual circuit elements run slower, decreasing circuit 
performance. Deactivation circuitry and bypass logic used within the cell will increase chip 
area and software overhead as well as reducing configurability. 
Tree Based Multiplier architectures such as the Wallace-tree algorithm are relatively fast 
algorithms, however the structure of the design does not follow a uniform pattern. The 
interactions are not similar between the elements of the cell, and the presence of half adders 
means that the components within the multiplier will not be uniform. To make the array 
configurable, additional circuitry are possibly needed to create consistency within the array 
as well as facilitating communication between adjacent cells. 
Dadda multipliers are smaller and more regular than Wallace-tree multipliers. The overall 
speed of Dadda implementation is approximately the same as that of the Wallace-tree, 
however, the major problem of this algorithm also is creating configurability. 
Booth Encoding reduces the number of partial products by a factor of two. As a result there 
is an increase in the area and the speed of the multiplier. The drawback to this algorithm is 
the overhead created by the complex data manipulation. The encoding circuitry will 
increase the overall chip size, and the complex nature of the algorithm, does not allow 
configurability. 
The placement and flow of the components within the Braun multiplier architecture is 
relatively uniform, facilitating configurability. Area optimisation may be achieved through 
careful selection of adder design style within the multiplier cells, making the Braun 
multiplication algorithm a promising solution for realising the arithmetic primitives to be 
implemented within the 3-D Soft-Chip architecture. 
The intension here was not to make the literature survey exhaustive but to provide an 
insight to the options available and discard architectures that are too complex, do not scale 
readily or consume too much silicon area. 
In the next chapter we will look at logic styles that would then facilitate effective 
implementation of configurable primitives. Alternative structures that are promising and 
that can implement basic arithmetic primitives as part of the ALU and ultimately provide 
the foundation for realisation of the 3-D Soft-Chip will be studied. 
29 
Chapter 4 
Logic Styles Mapping 
4.0 Summary 
The differing requirements within various applications have resulted in existence of various 
architectures for implementation of the basic operations such as addition, and 
multiplication. Design requirements focus on a number of factors such as, speed, area, 
power dissipation, and configurability. Invariably the choice of logic style influences the 
above listed factors. Therefore this chapter will provide an outline of logic styles and will 
follow through with the options that are available in the design of basic arithmetic 
primitives that ultimately would provide the foundation for realisation of the 3-D Soft­
Chip. 
4.1 Literature Review for Logic Styles 
Having stipulated the requirements for the configurable primitives namely the need for 
adders and multipliers that can support variable word-length with economical chip area, it 
is now necessary to look at the options that are available in circuit techniques and the 
related choice oflogic. 
4.1.1 Comparison of Logic Styles 
The literature review has revealed a variety of circuit techniques summarised in Table 4.1. 
The comparison parameters highlight important issues that need to be considered during the 
design process [52]. Table 4.1 illustrates the very fact that there are many options in logic 
styles, each with their own relative merits. 
For example comparing Static CMOS with Dynamic Logic, it can be observed that 
Dynamic Logic is a better performer (+ performance) than the Static CMOS. However, the 
reliability of Dynamic Logic is worse than (- reliability) that of Static CMOS. Similarly 
comparing Static CMOS with DCVSL, it can be seen that DCVSL has a high logic 
flexibility and density (+ reliability, + high logic flexibility). However it lacks performance 
(- performance). 
In summruy: 
• Dynamic circuits are more efficient in terms of chip area and performance when 
compared to static CMOS. 
• There are appropriate logic styles for each target performance determined by choice 
ofP, R, E, A and T. 
• Complementary Pass Gate (CPG) and Transmission Gate (TG) are compact and 
compatible with static CMOS and have been used effectively. 
• Complementary Pass Gate logic (CPL) is one of the simplest and fastest of the 
circuit families. 
30 
• Testing complexity increase with variations of dynamic style of logic.
Key to Table 4.1
o = neutral; + = more, - = less
P = Performance, R = Reliability, E = Energy, A =  Area; T = Testability complexity
4.2 XOR/XNOR Logic Styles [64] 
Exclusive-OR (XOR) and exclusive-NOR (XNOR) gates are important in digital circuits 
such as full adder design. The efficiency of the XOR/XNOR gates affects the performance 
of the larger circuits such as configurable word length arithmetic primitives needed in the 
3-D Soft-chip. Therefore it is highly desirable to investigate this logic style for its
suitability for adder implementation.
There are a number of techniques in the design of XOR/XNOR gates using either dual 
networks of pMOS and nMOS transistors or the alternative Pass-Transistor Logic (PTL). In 
PTL the source side of the MOS transistor is connected to an input line instead of the power 
lines. Here only one pass-transistor network (either nMOS or pMOS) is required. 
The behavioural description of XOR and XNOR gates for inputs a and b is expressed as: 
affib = (a'xb) + (axb') 
(affib)' = (axb)' + (axb) 
The XOR/XNOR gates use transistors, with single-rail inputs. The XNOR gate outputs a 
logic ' l '  when both signals at the input are equal, whereas the XOR gate yields a logic 'O'  
for the same inputs. Figure 4. 1 illustrates a number of variations for the CMOS based 
XOR/XNOR logic (XONL). 
31 
Table 4.1 Variety of Logic Styles 
Lo2ic Styles p R 
Static CMOS (CMOS) [53, 45, 54, 55, 62] 0 + 
Pulsed Static CMOS (PS-CMOS) f541 0 + 
Dynamic logic (Dynamic) [45, 53, 54, 61] + -
Single Phase Logic (SPL) [561 + -
Single Phase Logic Two Voltage (SPLTV) [56] + -
Complementary Pass Gate Logic (CPL) [1, 45, 61] + -
Exclusive OR/Exclusive NOR Logic (XONL) [63, 65] + 0 
Pass Transistor Logic (PTL) [54, 55] + 0 
Transmission Gate Logic (TG) f l] 0 + 
Differential Cascade Voltage Switch (DCVSL) [45] - + 
Cascade Non-Threshold Logic (CNTL) [45] - + 
Energy Economised Pass Transistor logic [54, 55] - + 
Domino Logic (Domino) [45, 57, 58, 59,601 + -
No-Race Logic (NRL) [45, 69] + -
Charge Recycling Differential Logic (CRDL) [681 0 + 
E A T 
0 + + 
0 + + 
- - -
- - -
- - -
- + 0 - - 0 
- + 0 
0 + 0 - + 0 
- + 0 
+ 0 -
- + -
- + -
0 - -
b' � 
Vdd Vdd Vdd 
b � 
b� 
00 00 
Figure 4.1 (a) XOR/XNOR Logic and (b) CMOS Implementation 
XOR 
Extension of the concept in the design of XOR/XNOR circuits is shown in Figure 4.2, 
where Complementary Pass-Transistor (CPL) based logic is adopted. 
� 
XOR 
b
 
h'-1 
a' 
� 
a � 
XNOR 
a'� 
b 
(c) 
vdd 
a� 
b' 
a � 
b 
XOR 
veld 
b 
a �  
a� 
XNOR 
b '  
(d) 
Figure 4.2 (c,d) CPL Implementation of XOR/XNOR Circuits 
A variation of CPL, the Double-Pass Transistor Logic (DPL), shown in F igure 4.3 is also 
used to implement XOR/XNOR circuits. 
a'---j h ---j
b -1 a -1
XOR XNOR 
a ---j h'---1
b� a'-1
(e) 
Figure 4.3 DPL Implementation ofXOR/XNOR Circuits 
The design for a 4-transistor XOR/XNOR circuits with single rail inputs is illustrated in 
Figure 4.4. The V dd in (f) is connected to input A Since it has no power supply, it is 
referred to as the Powerless XOR 
32 
vdd 
b 
vdd 
a 
a 
(f) 
XNOR 
vdd 
b 
XOR 
a 
XNOR 
a --e--+----+-' 
b -�----� 
(g) 
Figure 4.4 Implementation of 4-Transistor XOR/XNOR Circuits 
Figure 4.5 shows the circuit diagram for a modified version of XOR/XNOR gate. It is 
similar to (g) in Figure 4.3, with the difference that the new XNOR gate is Groundless 
XNOR. 
XOR a 
XNOR 
(h) 
Figure 4.5 Groundless XOR/XNOR Gates 
Table 4.2 shows a comparison of the transistor count and the critical path for the circuits of 
Figure 4 . 1  to Figure 4.5. All circuits have single rail inputs . Those with double rail inputs 
are modified to include added inverters (2 transistors) and transmission gates (2 transistors) 
to create the same base for comparison [ 64]. 
33 
Table 4.2 Comparison of XOR Gates 
To estimate performance we note that the power dissipation can be described by [ 1 ] :  
Poynamic = p::cixVi_swingXPi)xfc1k + iscXVctd + Li1eakXV dd 
where 
Ci = load capacitance, 
vi_swing = voltage swing, 
Pi = probability of a switch, 
fclk = clock frequency, 
isc = short-circuit current, 
i1ea1< = leakage current (very low, can be omitted) and 
V dd = supply voltage. 
The voltage swing Vi_swing is the voltage difference between logic ' 1 '  and logic 'O' .  In an 
ideal situation, transmission of logic ' 1 '  is equal to V dd and transmission of logic 'O' is 
equal to V88• The voltage swing is equal to the supply voltage and so a reduction in supply 
voltage results in lower power dissipation. 
The voltage swing is also reduced when the signals are not fully transmitted. This occurs 
when an nMOS transmits a ' l '  or a pMOS transmits a 'O' .  Thus having several transistors 
connected in series, results in a weaker driving capability at the output. This becomes an 
important design and reliability issue when technologies of0. 13µm and below are used and 
the supply voltage is small. 
Table 4.3 highlights the main contributors to an increase in power dissipation. These 
include short-circuit current isc, and an incomplete voltage swing where a ' l '  is transmitted 
through an nMOS transistor and a 'O' is transmitted through a pMOS transistor. The speed 
can be evaluated using the critical path in each of the circuits. Only the transistors that 
transmit a signal contributing to the output are counted as being part of the critical path. 
34 
XOR Circuit Transistor Count Critical Path Occurrence 
a 12 4 
b 12 3 
C 14 4 
d 9 4 
e 10 3 
f 4 2 
g 4 1 
h 4 2 
Table 4.3 Power Dissipation Elements 
XOR Circuit Incomplete Voltage Swing Source of isc
a none a and b 
b none Output 
c none a and b and Output 
d none a and b and Output 
e none a and b and Output 
f OO(Vss+Yth) and l O(Vdd-Vth) a 
g OO(Vss+ v th) none 
h OO(Vss+Yth) and Ol (Vdd-Vth) none 
The fastest gate is (g) as it has only 1 critical path that consists of 2 transistors. Circuits (f) 
and (h) are also expected to be fast, while the presence of the feedback loop in (d) will 
cause the circuit to be slow as the result of the voltage swing corrector. A comparison of the 
dissipated power for a load of 0.01 pF is shown in Figure 4.6. 
---a 
---b 
---c 
------- d 
------- e 
------- f 
g 
------- h 
,...."'
1.4 �--.---..---,----,---,----,-�-�--,----, 
I 
I 
I I I I I I I 
I I I I I I I I I 
1.2 ---r---r---r---r---r---r---r---r---r---
• I I I I I I I j.--
I I I I I I I I I 
1.0 
50.8 
Cl 0.6 
0.4 
0.2 
0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 
Power(µmW) 
Figure 4.6 Dissipated Power for Various XOR Circuit Configurations 
Both Powerless XOR and Groundless XOR consume approximately 10% less power than 
the design (f), and 3 times less power than Static CMOS XOR designs of (a) and (b). The 
fastest design is (h). The power-delay product is also illustrated in Table 4.4 for 
convenience of referencing. 
Table 4.4 Power-Delay Product with O.OlpF Load 
Power Delay Power-Delay 
h g g 
g f h 
f h f 
e e e 
a a a 
b c b 
d b d 
c d c 
35 
I 
_ _ _  1 - - - ·  - - - · - - - · - - - · - - - · - - - · - - - � - - - · - ' -
I I I I I I I I 1 ,  
I I I I I I I I , '1 
I ., "" I 
I I I I I I I , (  
- - - � - - - } - - - t - - - } - - - t - - - � - - - ! ;"-... - + -
I I , ' t 
1 , ' I 
I , 1  
- - - L - - - L - - - � - - - L - � - � � � - 1
I I I I , ... 
, I 
Approach (g) has the lowest power-delay product followed by the Powerless XOR and the 
Groundless XOR 
4.3 Classic Static CMOS Logic Styles 
CMOS implementation of the adder circuit involves realising the adder using pMOS and 
nMOS transistors. Several approaches wil l  be discussed: 
• Direct realisation of the full-adder behavioural description,
• Symmetric adder configuration,
• Multiplexer based adder design,
• Transmission Gate based adder design, and
• SPL adder design.
4.3.1 Direct Adder Implementation 
Figure 4 .7 illustrates the functionality of a full adder circuit [ l ,  2] .  
Co ---, FA :-- ci 
Figure 4.7 Full Adder Block Diagram 
The inputs to the circuit are Si and ai as well as Ci which is the carry input from the previous 
stage (if one exists). s0 is the sum output from the full adder and Co is the carry out to the 
next stage. The Boolean equation for the full adder operation is: 
So = a'b'ci + a'bci' + ab 'ci' + abci 
Co = bci + ab + aci 
While this design is reliable and uniform, it will need 46 transistors. Therefore it consumes 
large chip area for implementation as part of a configurable primitive. 
4.3.2 Symmetric Adder Implementation 
The symmetric adder shown in Figure 4.8 describes s0 as a function of the output carry Co. 
The behavioural description is: [ l ,  2] 
Co = ab + bci +aci 
S0 = abCi + Co' ( a + b + Ci) 
36 
Figure 4.8 Symmetric Adder Schematic 
Implementation of the symmetric adder will require a total of 28 transistors, making this a 
more efficient configuration compared to the direct CMOS implementation. This means 
that the symmetric adder saves silicon area by a factor of almost 40%. 
4.3.3 Multiplexer Based Adder Design [66, 67] 
Implementation of multiplexers into adder design, create new design options. In this section 
a number of choices from the literature are reviewed. Figure 4.9 shows the architecture for 
a 1-bit multiplexer based full Adder (MUXAl) using 6 multiplexer gates. 
a t-------c, 
s, 
Figure 4.9 1-Bit Multiplexer Based Full Adder (MUXAl) 
All internal nodes are connected to input signals a, b, and carry in, Ci, which results in 
minimisation of the switching activity value and the short circuit current of these nodes. 
Also the output nodes sum, s0 and carry, c0, are only charged and discharged through these 
37 
• . . . .  • . . . • . . . • • • . 
;1 • . . . • . 
b • ,· 
( • • s • . . • • 
. 
• • . • • 
• • • . . • . • • • • • • • • • • 
. 
input signals, enabling the circuit to recover most of the power dissipation at those nodes. 
The delay of sum, s0 and the carry c0 is modelled by: 
delayso = 3 x multiplexer gate critical delay 
delayco = 2 x multiplexer gate critical delay 
In order to improve the delay and transistor count, there are a number of variations to this 
approach. Two inverters are included in the circuit MUXA2 shown in Figure 4. 1 0 . The c0 
and So delays of MUXA2, corresponds to five multiplexer gates and two inverters. 
Figure 4.10 MUXA2 Design Variation 
The delay of the So is : 
delays0 = (2 x multiplexer gate critical delay) + (2 x inverter critical delay) 
while the delay of the c0 is modelled by: 
delayc0 = multiplexer gate critical delay + Sel port to Out port delay 
The power dissipation ofMUXA2 is higher than that ofMUXAI , due to the increase in the 
switching activity and the short-circuit current of intermediate nodes. 
MUXA3 is an improvement to the MUXA2 architecture and implements four multiplexer 
gates and two inverters as illustrated in Figure 4. 1 1 .  
Figure 4.1 1 The MUXA3 Design 
MUXA3 results in reduced power consumption compared to MUXA2, and the So signal 
delay is slightly less than that of MUXA2, but the disadvantage is that the Co delay is 
higher. 
38 
delayso = multiplexer gate critical delay + inverter critical delay + 
Sel port to the Out port delay 
delayco = multiplexer gate critical delay + Sel port to the Out port delay + 
inverter critical delay 
Figure 4. 1 2  shows MUXA4, another modification of MUXA2, which eliminates one of the 
inverters of MUXA2, and replaces it with two CMOS transistors. One of these is basically 
a CMOS inverter with its nMOS source connected to input signal 'a' instead ofVss, and the 
second one has its pMOS connected to input signal 'a' instead ofV dd- This approach results 
in reduction of short-circuit current and hence reduced power dissipation. The delay 
characteristic of MUXA4 reveals that the s0 delay is sl ightly higher than that of M
U
XA2, 
while the Co delay is almost the same. 
v .. 
.. 
Figure 4.12 The MUXA4 Design 
Although there are several options available for the design of the multiplexer primitives, the 
Pass-Gate Logic implementation [ l  ], illustrated in Figure 4. 1 3  provides a promising 
solution due to its high speed, relatively low transistor count and potentially highly 
optimised layout. This approach is considered for further investigations and is implemented 
in the design of adders. 
a 
cntrl 
Figure 4.13 2-Input CMOS Multiplexer Using Pass-Gate CMOS Logic 
Hence the 2-input CMOS multiplexer of Figure 4. 1 2  is embedded into the adder circuit 
diagram of Figure 4.9 to Figure 4. 1 2, as the multiplexer' s core primitive. The resulting 
number of transistors for each adder cell is outlined in Table 4.5 together with a summary 
of performance for the multiplexer based Adder. Also comparison is made with the 
conventional 28-transistor CMOS Symmetric adder for a 0.3 5µm, 3 .3V process [67]. 
39 
The columns 's0 Critical Delay' and 'c0 Critical Delay', indicate the percentage gain 
achieved with the multiplexer adder circuits as compared with 28-transistor Static CMOS 
Symmetric implementation. Also for the column 'PowerxDelay product', the values 
indicate the percentage savings made for the multiplexer adder cells over that of CMOS. 
Table 4.5 Comparison for Variations of Multiplex Based Adders 
Adder Cell Transistor s0 Critical c0 Critical PowerxDelay 
Number Delay% Delay% Product% 
MUXAl 30 20 16 0 
MUXA2 30 20 23 0 
MUXA3 22 24 21 15 
MUXA4 32 20 23 18 
The regular structure of MUXAI, appears to provide for the most regular layout which 
could meet the minimum chip area requirement. This option will be investigated further. 
Figure 4.14 illustrates the schematic for the multiplexer based adder derived from MUXAI. 
Figure 4.14 Multiplexer Based Adder 
4.3.4 Transmission Gate Based Adder 
An alternative multiplier architecture is the Transmission Gate (TG) based Full adder [1]. 
The advantage of this algorithm is its significant optimisation of area due to lower number 
of transistors per adder. Using TG based design a I-bit adder only has 18 transistors, 
compared to 28 transistors for multiplexer based adder. The schematic for the TG based full 
adder circuit is illustrated in Figure 4.15. 
40 
Figure 4.15 1-Bit TG Based Adder Schematic 
4.3.5 SPL and SPLTV Based Adder 
Simplicity of Single Phase Logic (SPL) style appears to satisfy the requirements for 
minimum silicon chip area. Figure 4.16 shows an adder design based on SPL [56]. 
_J__Q_' 
SUM 
Pa -transistor network Output buffer 
Figure 4.16 Single Phase Logic (SPL) Based Adde1· 
Slight modifications to the design of SPL result in the Single Phase Logic Two Voltage 
(SPLTV) element. Comparison of a single voltage SPL with that of a 2-voltage SPL, i.e. 
SPLTV, is shown in Figure 4.17. Note that in single voltage SPL: 
Vdd = 3.3V 
and in 2 voltage SPL, i.e. SPLTV: 
Vdd1 = 3.3V, and 
Vdd2 = Vdd-Ytn = 2.5V 
41 
rcf 
Figure 4. 17  SPL and SPLTV 
From Figure 4. 1 8, the dynamic power dissipation for Single Supply SPL is given by: 
Pdt = a.(CL +Cm )fV dd2 
The dynamic power for Two Voltage SPL is : 
Pd2 =a.Cd (V dd -V1n )2 < Pd1
The relation highlights that dynamic power dissipation for SPLTV is less than that of 
Single Voltage SPL. 
In (a,f) ----- ln (n)) i 
Figure 4.18 SPL Comparison of Single Voltage SPL and Two Voltage SPL (SPLTV) 
Due to shorter transitions during switching of SPLTV, its short circuit power dissipation is 
less than that of SPL. The power and delay characteristics of a full adder using 0 .35µm 
3.3V process, where the circuit load is a l OOµm metal track followed by 1 ,4 or 1 0  standard 
inverters, is i llustrated in Figure 4. 1 8 . The performance for SPLTV with Vdd2 =2.5V and a 
CPL (complementary extension of SPL) based Adder, is also included for comparison in 
Figure 4. 1 9  [56]. 
42 
1.2 
1.0 
0.8 
>, 0.6 
0.4 
0.2 
0 
I I I 
I I I I I I 
---L---1---�--- --- ---�----�---
I I I I I I 
SPLTV. : SPL : : CPL : 
I 
I 
I 
I I I 
__ J ___ J ___ J ____ �---
I I I I 
I I I 
I I I 
I 
I I I I 
-r--- ---,---,---,----r---
I I I I I I 
I I 
I I 
I I 
--- ---L---i---�---�----•----L---
I I I I I I I 
I I 
W � � W 100 IW MO 
Power(µmW) 
Figure 4.19 Power and Delay Characteristics of SPL, SPLTV and CPL Full Adders 
(0.35µm, 3.3V Process Technology) 
The power and delay characteristics of a full adder for 0.18µm process and Vdd = 1.8V 
with similar circuit loads of IOOµm metal track followed by 1, 4 and 10 standard inverters 
for SPL and CPL is shown in Figure 4.20. 
3. 
3. 
2. 
.__., 2 . 
I. Ci 
I. 
0. 
0 
---�---J---�-- -1----�---4---J---
I I I I 
I I I I I I ---r---,-- ,----,----r---r---,---
I I I I I I 
I t I 1 I I •sn• , , , , , 
---L--- ___ J ____ , ____ __ L ___ J __ _ 
I I I I 
I CPlJ I 
I I 
I 
I I I I 
___ I ___ I _ _ _ I ____ 1 ____ 1 ____ I ___ I __ _ 
I I I I I 
I I I I I 
I 
I I 
---�---·---�----,----�---+---�---
' I I 
I I I 
I I 
I I I I I I I ---. --- ,--- ,----.---- 1 --- , --- ,---
10 20 
Power(µmW) 
30 
Figure 4.20 Power and Delay of a Full Adder Using SPL and CPL for 
0.18µm, 1.8V Process Technology 
4.4 Conclusions 
The choice of logic style used to implement the arithmetic primitives is important, as it 
affects the efficiency of the overall system. Static CMOS is popular and produces results 
that are reliable and widely accepted [1]. On the other hand XORIXNOR gates can be 
implemented using AND, OR and NOT gates, which can lead to large redundancy of circuit 
elements within the system. 
43 
- - - r - - - T - - - -, - - -
t I I 
I 
"i 
Using Pass-Transistor Logic (PTL), only one pass-transistor network (either nMOS or 
pMOS) is required, which results in a lower number of transistors and chip area 
optimisation. However the disadvantage of this design is the voltage threshold lowering 
which necessitates the use of buffers, hence adding to the overall circuit power dissipation 
and an overall increase in chip area. 
Several variations for multiplexer based adder architecture using pass-gate CMOS 
multiplexer show that the multiplexer based cells have approximately 20% higher speed for 
s0, and c0 signals, as compared to the 28-transistor CMOS Symmetric Adder. Also the 
power delay product savings can be up to 20% higher. The regular structure of MUXAl, 
appear to provide for a regular layout which could meet the minimum chip area 
requirement. This option will be investigated further. 
Although power consumption is lower in SPL and CPL than Static CMOS, and layouts can 
be more compact, both logic styles are sensitive to scaled technology without a 
corresponding scaling ofVtn. This implies SPL, SPLTV and CPL are unsuitable logic styles 
for scaled technologies such as TDST (0.13 µm and below). 
44 
Chapter 5 
Scalable Serial/Parallel 
Multiplication Primitives 
5.0 Summary 
Multiplication is the most complex of all computations within the ALU, hence it is the most 
complex primitive in the ALU component of the 3-D Soft-Chip. The selection criteria for 
various design options, is dictated by the ability for word-length expansion within the component, configurability, and minimum silicon chip area usage. In the sections to follow 
two suitable architectures will be presented. These are: 
• Configurable serial/parallel multiplier, followed by,
• Configurable parallel/parallel multiplier in chapter 6.
Their suitability will be examined for implementation as part of "Sea-of-Pixel" processor 
concept. 
Hardware multiplication uses adders as its basic primitive. Therefore much of the foresight 
in Chapter 3 and Chapter 4 become invaluable in the choice of both the logic style and the 
design of the adder and multiplier cells. A very important consideration is the choice of the size of mxm basic serial/parallel cells featuring an input word length of n-bit wide. 
5.1 Configurable Serial/Parallel Multiplication Algorithm 
Serial/Parallel multiplier algorithm is an option for implementation in the ALU since 
inherently the notion of "serial" components points to reduced silicon chip area [ 1]. The 
following is a summary ofBermak's work [76], based on which the modified serial/parallel 
multiplier will be implemented. Two unsigned fixed-point numbers represented by m and n bits, respectively can be described by: 
a(m) = �-1 • . • •  ao
b(n) = bn-1 . . . .  bo
The double word-length product Q(m+n) is: 
where 
m-ln-1Q(m + n) = a(m)b(n) = LLaibii+i
m+n-1Q(m + n) = Lqk2k 
k=O 
k 
qk = Lai bk-1 
i=O 
i=O j=O 
45 
Using 2's complement format for representing signed numbers, A(m) and B(n) can be 
written as: 
m-2
A(m) = -am- 1 2
m
- l  + La; i
i=O 
n-2 
B(n) = -bn-1 2
°-1 + Lbji
j=O 
Various word length multiplications can be configured using the basic 1 -bit primitive, the 
design of which will be discussed later in this chapter. Figure 5 . 1  illustrates 4/8/1 6-bit 
configurable serial/parallel multiplier [76] .  
Figure 5.1 Configurable 4/8/16-Bit Serial/Parallel Multiplier 
The architecture configurations for higher order word lengths namely 1 6-bit through to 64-
bit array derived from basic 4-bit primitive are illustrated in Figures 5 .2, 5 .3,  5 .4, and 5 .5 .  
4-Bit Architecnae 
Adder 
Multiplier 
Logic 
R1 
:iJ....l:il:iJ....[Jl:iJ....1:il:iJ...1:11
11TlilHlHHil • •1 11 11 I . .. .. .. . i ................. J� ................. Ji ................. Jl ......... ....... J 
R2 
Figure 5.2 8-Bit Array Multiplication Using 4-Bit Multiplier Primitive 
Figure 5.3 16-Bit Array Multiplier Configuration 
46 
Figure 5.4 32-Bit Array Multiplier Configuration 
Figure 5.5 64-Bit Array Multiplier Configuration 
Subsequent mapping into a 16/32/64-bit configuration is shown in Figure 5.6 [76]. 
Figure 5.6 16/32/64-Bit Serial/Parallel Multiplier Configuration 
5.2 Logical Architecture of Proposed Configurable Serial/Parallel Multipliers 
Figure 5.7 provides a modified approach in mapping the basic 4, 8, 12, 16, 20, 24, 28, and 
32-bit serial/parallel multiplier primitive as part of a configurable array. The multiplier
computes the product of 2 unsigned binary numbers. The architecture includes related
multiplexer and input/output registers for storage of data.
In the sections to follow steps in the design and layout of the multiplier using 0.13µm 
process technology with l .2V supply will be presented. To implement the circuit layout, 
extraction, and simulation for each building block, most or all of these steps are followed as 
applicable: 
47 
• The Boolean equation for the building block is specified, F = . . .• To implement the function F, F' is realised.
• F' constructs the pull-down network. The pull-up network is the complement of
the pull-down network.
• The layout for the circuit is realised.
• Each basic cell is custom made, for example if a multiplier consists of an AND
circuit and a full-adder, instead of placing these cells side by side, they have
been integrated, through merging of common nodes in adjacent active regions
and stacking diffusion regions, to create a multiplier cell, hence optimising
silicon area.
• Where it is appropriate, individual component layouts are pitch matched to the
adjacent cells within the array. This means that the height of the component is
increased to be inline with other cells within the row. In this way power buses
run uniformly across various components within the array.
• The optimised layout is realised using 0.13µm, 1.2V process technology,
implemented on Mentor Graphics IC Design Platform.• The Design Rule Check (DRC) is carried out, using the Mentor Graphics DRCtool, ensuring the layout does not violate any design rules for the specifiedprocess technology.
• The spice netlist for the layout is produced through extraction of the layout,
using the Mentor Graphics IC Extract environment.
• The acquired spice file is modified according to the given spice model, by
appending introductory library source, environmental specifications, output
information, power components and power sources.
• The spice file is run using hspice, and the simulation result displayed in the
Awaves window.
• For circuits implementing large multiplication, the simulation result shown at
the end of each section, may be a sample of the complete simulation obtained, as
the volume of the display data is vary large.
• The name given to each circuit, within the Mentor Design environment, isindicated within each section in italics.
48 
Figure 5.7 Logical Architecture for a Configurable 4/8/12/16/20/24/28/32-Bit 
Serial /Parallel Multiplier 
The logical architecture of a serial/parallel multiplier is illustrated in Figure 5.7. The 
operation of this serial/parallel multiplier can be described as follows: 
• The task of the column of multiplexers labelled Ml is to input the signal bin into the
4-bit registers.
• If r = 0, then bin comes from the right (one of the bits of the binary number), and if
r = 1, bin is inputted from the lower stages of the array which means that more than
one 4-bit serial/parallel multiplier is utilised to implement higher order
multiplications.
• 4-bit registers need to be placed before each 4-bit serial/multiplier to allow 4 clock
cycles for the input to be transferred.
• The task of the multiplexers under column M2, with control signals q, is to transfer
bin from the right hand adjacent 4-bit register to the 4-bit serial multiplier (if q = 0),
49 
or transfer bin from above multiplier (when q = 1 ), which occurs if higher order 
multiplication is implemented. 
• The task of the multiplexers under column M3, with control signals s, is to transfer
partial products from the above 4-bit serial multiplier (when s = 1 ), which occurs if
higher order multiplication is implemented.
• The multiplexers under M4 are 8-input multiplexers needed to implement 4, 8, 12,
16, 20, 24, 28, and 32-bit multiplication.
• The 8-bit registers are needed so that completed results can be transferred out of the
4-bit multiplier.
• The configurability nature of this circuit allows it to be connected within an array of
serial multiplier cells. This also means that depending on the control signal status of
the multiplexers it can perform 4/8/12/16/20/24/28/32-bit multiplication. Table 5.1
outlines the various combinations of the control signals necessary to implement
various length multiplications.
Table 5.1 Configuration for Variable Word-Length Multiplications up to 32-Bit 
50 
Control 4x4 8x8 12x12 16xl6 20x20 24x24 28x28 32x32 
rO 0 1 1 1 1 1 1 1 
qO 0 0 0 0 0 0 0 0 
mOnOpO 000 001 010 011 100 101 110 111 
sO 0 0 0 0 0 0 0 0 
rl 0 0 1 1 1 1 1 1 
ql 0 1 1 1 1 1 1 1 
mlnlpl 000 001 001 001 001 001 001 001 
s1 0 1 1 1 1 1 1 1 
r2 0 1 0 1 1 1 1 1 
q2 0 0 1 1 1 1 1 1 
m2n2p2 000 000 001 001 001 001 001 001 
s2 0 0 1 1 1 1 1 1 
r3 0 0 0 0 1 1 1 1 
q3 0 1 0 1 1 1 1 1 
m3n3p3 000 001 000 001 001 001 001 001 
s3 0 1 0 1 1 1 1 1 
r4 0 1 1 1 0 1 1 1 
q4 0 0 0 0 1 1 1 1 
m4n4p4 000 000 010 011 001 001 001 001 
s4 0 0 0 0 1 1 1 1 
r5 0 0 1 1 0 0 1 1 
q5 0 1 1 1 0 1 1 1 
m5n5p5 000 001 001 001 000 001 001 001 
s5 0 1 1 1 0 1 1 1 
r6 0 1 0 1 0 0 0 1 
q6 0 0 1 1 0 0 1 1 
m6n6p6 000 001 001 001 000 000 001 001 
s6 0 0 1 1 0 0 1 1 
r7 0 0 0 0 0 0 0 0 
q7 0 1 0 1 0 0 0 1 
m7n7p7 000 001 000 001 000 000 000 001 
s7 0 1 0 1 0 0 0 1 
5.3 Layout Implementation of Serial/Parallel Multiplier 
The serial/parallel multiplier is implemented using CSA architecture style, with TG based 
adders to create multiplier cells. This decision was made due to the area efficiency of TG 
based adders. The layout of configurable serial/parallel multiplier is realised using 0.13 µm 
process technology and 1.2V voltage supply. 
5.3.1 Layout of 1-Bit Serial/Parallel Multiplier 
The 1-bit serial/parallel multiplier block of Figure 5 .8, SerMultl bit, is composed of an 
AND gate and a TG based full-adder circuit. 
Register 
/ 
t-----Sin FullAdder Cin 
Figure 5.8 1-Bit Multiplier Block Diagram 
The 1-bit multiplier adds the partial product of two bits from two binary numbers, a and b, 
to the sum of the previous stage and the carry from the previous clock cycle. It produces a 
sum for the next stage and a carry for the next clock cycle. The bin input is transferred to the 
adjacent multiplier cell through a register. Another register saves the c0 of the previous 
cycle for the Ci of the next cycle. Here a Carry Save Adder, described in chapter 2 is used as 
the most suitable option. 
The optimised layout together with the simulations of the output using 0.13 µm process and 
1.2V supply, are shown in Figures 5.9 and 5.10 respectively. The dimension for the 1-bit 
multiplier is 5.0µmx4.5µm which results in a silicon utilisation of 22.5µm2.
Figure 5.9 1-Bit Multiplier Layout Using 0.13µm Process and 1.2V Supply 
51 
The simulation result for the circuit is illustrated in Figure 5 .11. 
·-------
e-n..:1·-·-- - --- ------··----11:=:: ·�· - --------'·.! 
r-·· : i�L�.. .  __ . __ ___ _ __ J
i J �L---·- ·--········-·····-··---·--·-··-·····-··································-··J 
.Ji:[:_ ..... �- C·���-'. .. ____ I
Figure 5.10 1-Bit Multiplier Simulation 
5.3.2 Register Circuit Layout 
The block diagram of the register circuit, Reg, shown in Figure 5 .11, illustrates the 
synchronous nature of the circuit. 
�k'
in 
cir'-------------�
Figure 5.11 Register Block Diagram 
The register is reset at clr = 1. Furthermore it samples or reads data at elk= 0, and writes or 
outputs data at elk = 1. The optimised layout of the register circuit, having dimensions of 
4.84µmx3.94µm is shown in Figure 5.12. 
Figure 5.12 Register Layout Using 0.13µm Process and 1.2V Supply 
52 
__tk 
1----,----out' 
Tl
l( 
5.3.3 Multiplexer Circuit Layout 
The multiplexer circuit, Muxl 2, facilitates communication between each 4-bit 
serial/parallel multiplier within the array and transfers data such as partial multiplication 
results and bin· Figure 5.13 illustrates the logical architecture of the multiplexer. 
b a 
�,001 
Out 
Figure 5.13 Multiplexer Logical Architecture 
The multiplexer output is determined by the value of the control signal cntrl. If cntrl = 0, 
then 'a' appears at the output, otherwise 'b' is transferred to the output. The optimised 
layout of the multiplexer circuit with dimensions of2.4µmx4.02µm is shown in Figure 5.14 
Figure 5.14 Multiplexer Layout Using 0.13µm Process and 1.2V Supply 
Extending the algorithm implemented for the multiplexer of Figure 5.14, into an 8-input 
multiplexer circuit, creates the circuit 8JPMUX, the block diagram of which is shown in 
Figure 5.15. 
Out 
Figure 5.15 8-Input Multiplexer Block Diagram 
Here one of 8 inputs are selected to be transferred to the output, based on the logic values 
of the control signals m, n and p. The following truth table describes the behaviour of the 
multiplexer. 
53 
Table 5.2 8-Input Multiplexer Behaviour 
m n p Out 
0 0 0 a 
0 0 1 b 
0 1 0 c 
0 1 1 d 
1 0 0 e 
1 0 1 f 
1 1 0 g 
1 1 1 h 
The schematic and the optimised layout of the circuit are illustrated in figure 5 .16. The 
dimension of the layout is 10.04µmx4.05µm. 
(a) (b) 
Figure 5.16 8-Input Multiplexer (a) Schematic, (b) Layout 
5.3.4 Bit Slice Layout 
A bit slice of the configurable serial/parallel multiplier for 4-bit word lengths through to 32-
bits is shown in Figure 5.17. The serial/parallel multiplier is composed of: 
• a 4-bit register, Reg4bit,
• a 4-bit serial/parallel multiplier SerMult4bit,
• an 8-bit register Reg8bit,
• 3 two-input multiplexers and
• an 8-input multiplexer, 8IPMux, for data transfer between cells.
a3 a2 a1 ao 
M3 M2 
sO O 
Ml 
\ I \ I inO 
SerMult4bit Reg4bit 
54 
Figure 5.17 4-Bit Multiplier with Registers and Capability for Word Expansion 
The functionality of the 4-bit multiplier with registers and capability for word expansion of 
up to 32-bits is as follows: 
• The multiplexer circuit Ml, chooses the serial data bin to be inputted into the circuit
if the mode of operation is based on the 4-bit multiplication, otherwise the bin signal
is received from lower cells implementing higher-bit multiplications.
• The b signal is transferred into the 4-bit register, and 4 clock cycles later it reaches
the input ofM2.
• M2 decides whether the b signal will be transferred to the 4-bit multiplier cell for 4-
bit multiplication or the b signal from upper level is inputted into the multiplier for
higher bit multiplication.
• M3 is responsible for transferring partial products from above cells in case of
higher-bit multiplications.
• The partial product is computed using the 4-bit serial/parallel multiplier.
• The 8-input multiplexer than decides whether this partial product or other partial
products from other rows within the array will be transferred to the 8-bit register as
an output.
The optimised layout of 4-bit serial/parallel multiplier with embedded registers and 
multiplexers is shown in Figure 5.18. The layout has dimensions of 135.6µmx5.2µm 
resulting in a silicon area utilisation of 704µm2. Hspice simulation for circuit is also shown
in Figure 5.19. 
. . . . . . . . . . . . 
·- ---.:ii:,;-iiiiii - -
. . . . . . . . . . . . 
Figure 5.18 4-Bit Clocked Serial/Parallel Multiplier Layout in 0.13µm, 1.2V Process 
...... -.c-• ...... .._.. 
-.. , 
1 ..... "1">! 
-:::::: 
.::;:!. .......... 
.... "(M') 
-"V'II -·-
:,NA ........ } 
-�-1 
f -: 
I -:CL __________ ___ ... _ . _ ------� 
l =l 
f :i!!J ' I i:l!J: = -- =-=- - -=-·--·=-·-·-=--=-=· ·-·-=========:1
l .! I · -- --- ----­
f -:1 -----
1�._���������� �·1 
f=l_�· ---------- -�·1 
f -=ll <>- -- • ----------:::_] 
f =l_,_· -----------�11 
1-=lL· __ 
f -=!�-----------�] 
I ..; t.�,-----'�-sc.: -:.::.-·-_;___ -- .;:.. -'·-- _.,_:: -.:l-L:-_::.f.��.J��--.... ......... - .... - - - =� . .::: -- - ,_ - - - - ...--------
Figure 5.19 4-Bit Clocked Serial/Parallel Multiplier Simulation 
55 
r 
l 
5.3.5 4/8/12/16/20/24/28/32-Bit Clocked Serial/Parallel Multiplier Layout 
The configurable 4/8/12/16/20/24/28/32-bit multiplier circuit of Figure 5.7, SM32bit, is 
composed of a 1 x8 arrays of the 4-bit serial/parallel multiplier circuit, SM4bitWR. The 
optimised layout of the multiplier circuit is illustrated in Figure 5.20 and has a dimension of 
148 .2 µmx 56 .1 µm corresponding to a silicon area utilisation of 8 314 µm2. 
Figure 5.20 Layout for 4/8/12/16/20/24/28/32-Bit Configurable Serial/Parallel 
Multiplier in 0.13um, 1.2V Supply Process 
Simulation for a sample 4-bit data is shown in Figure 5 .21. Figure 5 .22 illustrates the 32-bit 
sample. 
Figure 5.21 Simulation for 4-Bit Multiplication 
56 
5.4 Conclusions 
···1.•. 
¥(�) 
Vl4H 
Ylt!Jl 
V(.i!il 
�1•11l 
Vf.21) """ 
-
... ,, 
...... 
Y(<lll1J 
VCblOI 
VCbU) 
Figure 5.22 Simulation for 32-Bit Multiplication 
Serial/parallel multiplier is an option for implementation in the ALU since the notion of 
"serial" arithmetic points to reduced silicon chip area. The 4/8/12/16/20/24/28/32-bit 
multiplier circuit is configurable, which means depending on the control signal status of the 
multiplexers it can perform 4/8/12/16/20/24/28/32-bit multiplication. It consumes a small 
silicon area of 8314µm2 using 0.13 µ m process technology and 1.2V supply voltage. 
The main draw back of the serial/parallel multiplier approach is that it consumes 4n clock 
cycles (n = word length), according to the number of word length used, before results are 
made available. For large word lengths this can be a problem. However it is a possible 
option for implementation in the 3-D Soft-Chip. 
In the next Chapter we compare this approach with the parallel/parallel architecture. 
57 
, ..... 
.. . , ...
--o ... •·· -
Chapter 6 
Scalable Parallel/Parallel 
Multiplier Design 
6.0 Summary 
Multiplication can be carried out using various options. The adder implementation within 
each multiplier contributes significantly to the performance of the circuit. Adder 
architectures choices include, CSA, CBA, Brent-Kung adder, Transmission Gate (TG) 
based adder and Symmetric adders. 
In this Chapter a number of parallel/parallel multipliers using several approaches such as 
Symmetric based, Multiplexer based and TG based configurations are implemented in 
0.13µm technology. Each configuration is considered according to the advantage it will 
provide for the construct of the ALU. 
As with serial/parallel design, the full custom layout was implemented, extracted, and 
simulated for each building block, following similar design flow of section 5 .2. 
6.1 Decision Issues 
Several issues have to be considered in deciding between various parallel/parallel multiplier 
architectures: 
• The layout structure of the Wallace-tree multiplier is more suited to full-custom
layout, as it can be made into a square shape easier however it is slightly larger than
a Dadda multiplier.
• Any of the parallel multiplier architectures may be pipelined. We may also use a
varied pipelined approach that tailors the register locations to the size of the
multiplier.
• Although power dissipation is reduced using the tree based structures, these
configurations are less regular than an array multiplier. Regularity is an essential
characteristic as it facilitates configurability.
The main objectives of this thesis are area minimization and configurability, hence the 
architecture styles used to implement the parallel/parallel multiplier are (i) Symmetric 
based, (ii) Multiplexer based and (iii) TG based configurations. Each configuration is 
chosen according to their advantages and as a base for comparison. 
Another important factor in this design is deciding upon the size of the basic array. 
Potential options to choose from are 1 x 1 ,  2x2, or 4x4 array for each of the basic blocks. A 
1 x 1 array, can implement all different combinations of multiplications but will result in 
significant area overhead and wastage of silicon due to redundancy. A 2x2 basic cell, could 
58 
_. ................. _. ...__________________��-
also realise the majority of multiplication combinations. However the same disadvantage of 
the 1 x 1 configuration applies here. Thus the option of having a basic cell of 4x4 array 
appears as the most suitable choice for building larger blocks. In this way, most 
multiplication configurations can be implemented maintaining the requirement for an 
optimum area. 
The complete set of design schematic and simulation results for each of the designs in the 
sections to follow is included in Appendix A 
6.2 Unsigned Multiplication 
In this section several options in the design of unsigned multipliers are considered. 
6.2.1 Symmetric Based Multiplier 
The objective is to design various word length binary multipliers based on Symmetric or 
Mirror Adder algorithm. 
6.2.1.1 1-Bit Symmetric Based Multiplier Circuit 
Figure 6.1 illustrates the block diagram of a 1-bit Symmetric based multiplier circuit. The 
multiplier structure consists of an AND circuit and a symmetric full-adder. 
Co-- FA -o c; 
5
o I 
Figure 6.1 1-Bit Symmetric Based Multiplier Block Diagram
The behavioural description of the AND gate and the full-adder circuit are outlined in Table 
6.1. 
Table 6.1 Behavioural Description of (a) AND Gate (b) Full Adder Circuit 
Si ai Ci Co So 
0 0 0 0 0 
a b out (a AND b) 0 0 1 0 1 
0 0 0 0 1 0 0 1 
0 1 0 0 1 1 1 0 
1 0 0 1 0 0 0 1 
1 1 1 1 0 1 1 0 
1 1 0 1 0 
1 1 1 1 1 
(a) (b) 
59 
The Boolean equation for the AND gate is: 
F=a.b 
To implement the function F, realise F': 
F' = (a.b)' 
This means that the final circuit realizing F, will have an inverter at the output. F' 
constructs the pull-down network, and the pull-up network is the complement of the pull­
down network as illustrated in Figure 6.2. 
Figure 6.2 AND Circuit Schematic 
In the full adder circuit, s; and a; are the inputs, c; is the carry input from the previous stage 
(if one exists), So is the sum output from the full adder and c0 is the cany out to the next 
stage. The Boolean equation for the full adder operation is: 
So = a'b'ci + a'bc;' + ab'c;' + abc; equation 6.1 
Co = be; + ab + ac; 
The symmetric adder describes s0 as a function of the output cany c0 • Hence the
behavioural description is: 
Co = ab + be; +ac;
So = abc; + Co' (a+ b + c;) 
The symmetric adder will require a total of 28 transistors as compared to direct CMOS 
implementation of equation 6.1, which will consists of 46 transistors. This means the 
symmetric option has the potential to save silicon area by almost 40%. To implement c0, 
first realise Co ' : 
co' = (ab+ be; +ac;)' 
c0' constructs the pull-down network. The pull-up network is the complement of the pull­
down network. However, the characteristic of symmetric adder is that the pull-up and the 
pull-down circuits are identical. To implement the function So, first realise so': 
so' = (abc; + Co' (a+ b + c;))' 
s0' constructs the pull-down network. The pull-up network of the mirror adder is identical 
to the pull-down network. The symmetric adder schematic is shown in Figure 6.3. 
60 
Figure 6.3 Symmetric Full Adder Schematic 
Combining the AND and the symmetric full adder circuit, we obtain the 1-bit multiplier 
circuit SymMultiplier. This multiplier adds the partial product of two bits from the two 
binary numbers a and b, to the sum of the previous stage Si and the carry from the previous 
stage Ci. It produces a sum s0 and a carry c0 for the next stage. The schematic and the 
optimised layout having dimensions of 6.4µmx4.8µm in 0.13µm process for the 1-bit 
symmetric based multiplier are illustrated in Figure 6.4. 
(a) (b) 
Figure 6.4 1-Bit Symmetric Based Multiplier (a) Schematic (b) Layout 
The circuit layout is then extracted, producing the spice netlist, used for simulation. The 
simulation result for the 1-bit multiplier circuit is illustrated in Figure 6. 5. 
61 
R r,, Ail 
i • ,-
. '-'--
i ..:! I=-'-------------�, 
i • 
1-: l 
Figure 6.5 1-Bit Symmetric Based Multiplier Simulation Result 
6.2.1.2 2-Bit Symmetric Based Multiplier Circuit 
The 1-bit multiplier block, SymMultiplier, can be configured in a 2x2 array to create a 2-bit 
multiplier, SymMult2bit. P0 is the result of the multiplication. The logical architecture for 
the 2-bit multiplier is illustrated in Figure 6.6. Si1 S;o lio 
SymMultiphe -- SymMulbplie -- Ci() Co0 
�___, ___ Po 
'----I---, 
�---+--l-+-11---- b1 
T SymMultipher -- SymMultip!Jer -- c;1 c 
I I P3 P2 P1 
Figure 6.6 Logical Architecture for 2-Bit Symmetric Based Multiplier 
The optimised layout of the 2-bit symmetric based multiplier circuit having dimensions of 
12.9µmx10.2µm is also shown in Figure 6.7. 
62 
1 1 · 1 1 1 1 1 I I !  I 
I ri 
I I I I 
Figure 6.7 Layout for 2-Bit Symmetric Based Multiplier 
6.2.1.3 4-Bit Symmetric Based Multiplier Circuit 
The 4-bit multiplier block, SymMult4bit, consists of a 4x4 array of the 1-bit multiplier 
circuit SymMultiplier. Figure 6.8 illustrates the logical architecture for this circuit. 
S;3 s;2 Sil 
bo 
--Ci! 
L.,_-+--P1 
-.-1-----i----l...-..4-----l----1-4-l----l,---l-4-l--�b2 
[ 
SymMultiplicr __ SyrnMultiplicr __ SymMultiplicr 
co 
7 I P6 I PS I P4 
__ SymMultiplicr __ Cj3 
Figure 6.8 Logical Architecture for 4-Bit Symmetric Based Multiplier 
The corresponding layout of the 4x4 multiplier circuit with dimensions of 25.7µmx21.3µm 
is illustrated in Figure 6.9. 
63 
Figure 6.9 4-Bit Symmetric Based Multiplier Layout 
6.2.1.4 8-Bit Symmetric Based Multiplier Circuit 
The 4-bit multiplier block, Sym.Mult4bit, can be configured in a 2x2 array to create an 8-bit 
multiplier, SymMult8bit. The schematic and the optimised layout having dimensions of 
51.1 µmx42.6µm for the circuit are shown in Figure 6.10. The area utilised is approximately 
2177um2. 
(a) (b) 
Figure 6.10 8-Bit Symmetric Based Multiplier (a) Schematic (b) Layout 
64 
,= ' -�-""• -
�· � .. -L- .... ,, -· IF'� � ..... -:..: �. • BlE.:. �:,:. 
..,._ .. ..... .. 
�-... � ...... iif-; . ..-1u(;.3 �K� ;,. . .. 
� .• ,.-, -
·,>s-� ,., -'•·�-, "�· "
-� .. 
..-i,,:.;..,,;.� .;�- - . ·.=.;.· ..;l,;. -� . -i:-.;;c. ,.;· ;,;."!' -- -___ ,. ,. _,:.;.;; r ·�- '-�-·
..: .., ..._;, �'"':ii, �-.,.:.,,., --� ��;>':� �-·.t. -=·�::_ ... 
6.2.2 Multiplexer Based Multiplier Circuit 
An alternative approach to symmetric architecture is multiplexer based design. The circuits 
implemented in this way result in considerable reduction in circuit area. 
6.2.2.1 1-Bit Multiplexer Based Multiplier Circuit 
The 1-bit multiplexer based multiplier block, MuxMult, is composed of an AND circuit and 
a multiplexer based full-adder circuit, as shown in Figure 6.11. 
Si a 
Full Adder 
MuxMult 
s, 
Figure 6.11 1-Bit Multiplexer Based Multiplier Block Diagram 
The 1-bit multiplier adds the partial product of two bits from the two binary numbers a and 
b, to the sum of the previous stage Si and the carry from the previous stage Ci. It produces a 
sum So and a carry Co for the next stage. The schematic and the optimised layout with 
dimensions of 6.1 µmx4.4µm for the circuit are illustrated in Figure 6.12. 
(a) (b) 
Figure 6.12 1-Bit Multiplexer Based Multiplier (a) Schematic (b) Layout 
The simulation result for the 1-bit multiplier circuit is shown in Figure 6.13. 
65 
. ., ' -- -1---·----· 
i • 
t :1 .::j _ _____ . __  _J 
j::1 i ! Ii ! i ! I / 1 / 
- .,, ... ,._ ... , .........
.. 
.., r· I "'!1 ... 1 \_ __ .,_,..! u , ... .  J ' ..... .1.1 
Figure 6.13 1-Bit Multiplexer Based Multiplier Simulation Results 
6.2.2.2 2-Bit Multiplexer Based Multiplier Circuit 
The 2-bit multiplexer based multiplier block, MuxMult2bit, is composed of a 2x2 array of 
MuxMult circuit. The optimised layout of the 2-bit multiplier circuit, having dimensions of 
12.28µmx 10.11 µm, follows in Figure 6.14. 
Figure 6.14 2-Bit Multiplexer Based Multiplier Layout 
6.2.2.3 4-Bit Multiplexer Based Multiplier Circuit 
The 4-bit multiplexer based multiplier block, MuxMult4bit, is composed of a 4x4 array of 
MuxMult circuit. As discussed earlier larger circuit arrays will be constructed based on this 
circuit. The optimised layout of the 4-bit circuit with dimensions of 25.6µmx20.4µm is 
illustrated in Figure 6.15. 
66 
I , I 
I -: 
I 
, 1 -: 1'-'-.......... __ -���I 
• '1 
Figure 6.15 4-Bit Multiplexer Based Multiplier Layout 
6.2.2.4 8-Bit Multiplexer Based Multiplier Circuit 
The 8-bit multiplexer based multiplier block, MuxMult8bit, is composed of a 2x2 array of 
MuxMult4 circuit. The schematic and the optimised layout having a dimension of 
50.3 µmx40.6µm corresponding to chip area of 2042µm2 are illustrated in Figure 6.16. 
- ·-
--- -- �-· ...... + -- --· �-� ·- ' --
(a) (b) 
Figure 6.16 8-Bit Multiplexer Based Multiplier (a) Schematic (b) Layout 
6.2.3 Multiplier Design with Periphery Multiplexer Per Bit 
Multiplier Design with periphery multiplexer per bit shown in Figure 6.17, involves 
allocating 3 multiplexers for each I -bit multiplier, to allow communication with adjacent 1-
bit multipliers within the array [62]. For this algorithm, multiplexer based multipliers will 
be used as they seem to occupy less chip space than that of the symmetric design, based on 
the transistor count from their schematic. The I-bit multiplier with periphery multiplexer 
per bit, Mux&Mult, is composed of a 1-bit multiplier circuit (MuxMult), surrounded by a 
vertical multiplexer (Mux) at the top and two horizontal multiplexers (Mux2) at the side of 
the multiplier circuit. 
67 
--
cntrh 
MuxMult 
Ci 
So0 
Figure 6.17 1-Bit Multiplier with Periphery Multiplexer Per Bit Block Diagram 
The behavioural description of the Mux&Mult circuit is described in the Table 6.2. 
Table 6.2 Behavioural Description ofMux&Mult Circuit 
cntrv cntrh out Coo Sot 
0 0 0 co b 
1 si 0 co 
The vertical multiplexer Mux, (Figure 6.18), is designed to lie at the top of the multiplier 
block. 
Out 
Figure 6.18 Vertical Multiplexer Block Diagram 
The inputs to the multiplexer are O and a, which are transferred to the output based on the 
control signal cntrv, as outlined in table 6.3. 
Table 6.3 Behavioural Description of the Vertical Multiplexer Circuit 
s Out
0 0 
1 a 
The schematic and the optimised layout for the vertical multiplexer circuit having a 
dimension of3.65µmx 1.97µm are illustrated in Figure 6.19. 
68 
ntrv 
out 
a; 
b; 
� 
I I 
(a) (b) 
Figure 6.19 Vertical Multiplexer (a) Schematic (b) Layout 
The simulation result for this circuit is illustrated in Figure 6.20. 
,-
1:: 
i= n ,--1 r-.J L _ _J �--J ...... .
i-
i:: 
Figure 6.20 Vertical multiplexer simulation results 
The horizontal multiplexer Mux2 (Figure 6.21 ), is designed to be placed horizontally at the 
left boundary of the multiplier block. 
b a 
�'"� 
Out 
Figure 6.21 Horizontal Multiplexer Block Diagram 
As shown in Figure 6.21, the inputs to the multiplexer are a and b, which are transferred to 
the output, based on the control signal cntrh, described in table 6.4. 
Table 6.4 Behavioural Description of Horizontal Multiplexer Circuit 
s Out 
O b 
1 a 
69 
__:_-· _ ,. ,  . ...  .. ... .... .  ::. ... -. . ... . , . ..  - . ...  -
The schematic and the optimised layout with dimensions of l.94µmx4.59µm for the 
horizontal multiplexer circuit are illustrated in Figure 6.22. 
(a) (b) 
Figure 6.22 Horizontal Multiplexer (a) Schematic (b) Layout 
The circuit layout has been pitch matched to make the height of the cell equal to that of the 
adjacent multiplier circuit. This is so that the power buses running across the arrays follow 
a uniform and straight path. The simulation result for the multiplexer circuit is illustrated in 
Figure 6.23. 
..,···--·---
1-, 
l= 
•! 
, r-- -'. 
I 
I ' 
l_J [_J 
. . . . . . . . -· - ....... � .......... ::..!;'t,..:;� ..... ,. ................ ... 
Figure 6.23 Horizontal Multiplexer Simulation Results 
6.2.3.1 1-Bit Multiplier with Periphery Multiplexer Per Bit 
Based on previous sections, the 1-bit multiplier with periphery multiplexer per bit can be 
designed. The schematic and the optimised layout of the circuit with dimensions of 
8.82µmx6.81µm are illustrated in Figure 6.24. 
70 
,-1= 
(a) (b) 
Figure 6.24 1-Bit Multiplier with Periphery Multiplexer Per Bit (a) Schematic (b) 
Layout 
The simulation result for the 1-bit multiplier circuit follows in Figure 6.25. 
).:=, 1�1 
:.t;'1 
�:E 'i�t -------- - . . ______ _] 
·..:� 11 � ·-1 .n, t . . ____  J � -_J L _  J _ __ __, 
t�1,_ ______ __J, ' . r·-· - ----- -- ---- -·
t-: I I 
'..:J j f . \ 
Figure 6.25 1-Bit Multiplier with Periphery Multiplexer Per Bit Simulation 
6.2.3.2 2-Bit Multiplier with Periphery Multiplexer Per Bit 
The 2-bit multiplier, Mux&Mult2bit, is composed of a 2x2 array of Mux&Mult circuit, as 
shown in Figure 6.26. 
71 
h�\_1. 11 J� 
' . �t-:1
cntrhl sil al cntrhO siO aO 
' 
C;o 
C( --r--- Cj I : _____ -- ----------r----: i ----- -------------,-----: 
So4 So3 Sol 
Figure 6.26 2-Bit Multiplier with Periphery Multiplexer Per Bit Block Diagram 
The optimised layout of the multiplier circuit having a dimension of 17.9µmx14.5µm is 
illustrated in Figure 6.27. 
Figure 6.27 2-Bit Multiplier with Periphery Multiplexer Per Bit Layout 
6.2.3.3 4-Bit Multiplier with Periphery Multiplexer Per Bit 
The 4-bit multiplier block, Mux&Mult4bit, is composed of a 4x4 array of Mux&Mult 
circuit. Larger arrays will be constructed based on this circuit. The block diagram of this 
circuit is illustrated in Figure 6.28. 
72 
� -n '-. fl ,, . o a a;:.;1,ie .. :ise- a,.:;';'c ... �  W\;<o r:.;-atJ r:c::}n E<I� 
� c:J>fi;" ·c� t:..au "� � � cl:cJti! ' 
-� ci=-� �  . -4:-'Q; ar.i � ii_ ·1) g n n n · o 
:J�- � t ::�:h1r..i°[Lu:_� ::-.:.1 ;,�  c�....=rr::a::r:-)
1 0 
-- .,.. 
l1· u � u :, a u n r.J o 
� 
0�i-�!-:�: �� lt:�•�-��i-."o �Jn�·-o __ : • 
._. ..D - - a 
0:-�;'<·l;;'.ZJ"·C::.:..�, ca,g 
� t.:...:Dft·r; . . ; ___ .. . L;.J:lft 
.:. .. -Jl . tJ 
·�.,, ,i.::mli- =,;1<, i:.:isfj
i,..i ::.ir.: nr¥:: �Q 
rl�: ·c...-!Dil, �-,L=aft • 
.;J a tir".:, L-:r/!: m. -'. _a �
,, • Cl 
n · n n o 
:.,:-�.:, o , - �  ::::.:::l C·n·· ·n �---==-..:::r:::a::?l'"J o t.=�:� 
n :: n J1 o n r. n !'1 
O' ...• _ .fl.., . .  �:lL:." "-n'� P · '.jc�:.1-.;.:-ll:..::�u;;.�?::c::.:.:: - �...!'� 7J""ii·- .· c: n : ... a . - .u: � r. (l n a 
t!Il!il .c:Jr.': '-+:i l!L,i:.::_;�G =,.ifJ .=::f r.:;;::l IOC::Jl:ll 
cntrh3 si3 
Co 
P2 
Figure 6.28 4-Bit Multiplier with Periphery Multiplexer Per Bit Block Diagram 
The optimised layout of the multiplier circuit with dimensions of 35.5µmx29.2µm is 
illustrated in Figure 6.29. 
Figure 6.29 4-Bit Multiplier with Periphery Multiplexer Per Bit Layout 
6.2.3.4 8- Bit Multiplier with Periphery Multiplexer Per Bit 
The 8-bit multiplier block, Mux&Mult8bit, is composed of a 2x2 array of Mux&Mult4bit 
circuit. The schematic and the optimised layout of the circuit having dimensions of 
70.7µmx59.6µm corresponding to a chip area of 4214µm2 are as illustrated in Figure 6.30. 
73 
• I� 
,. 
. 
i, 
. 
� 
. 
. 
(a) (b) 
• ii ! !?
�
w 
I . :::ii= .i ... 
��� 
Figure 6.30 8-Bit Multiplier with Periphery Multiplexer Per Bit (a) Schematic (b) 
Layout 
6.2.4 Multiplier Design with Periphery Multiplexer Per Array 
This algorithm implements multiplier arrays with multiplexers on the outside periphery of 
the basic array instead of each multiplier cell. This allows communication with adjacent 
arrays [62]. The architecture is implemented by placing 1 multiplexer for each column and 
2 multiplexers for each row of multipliers within the basic array. 
As discussed before the basic array size, based on which larger configurations will be built, 
is a 4x4 array, however for the comparison and completeness, the 2x2 array will also be 
realised. In this design, multiplexer based multipliers will be used as they appear to occupy 
less chip space than the symmetric design. 
6.2.4.1 2-Bit Multiplier Circuit with Peripheral Multiplexers Per Array 
The 2-bit multiplier, PerMuxMult2bit, is composed of a 2x2 array of MuxMult circuit, 
surrounded by periphery multiplexers, to enable communication with adjacent arrays, as 
illustrated in the block diagram of Figure 6.31. 
cntrhO al siO aO 
Figure 6.31 2-Bit Multiplier with Peripheral Multiplexers Per Array Block 
74 
sil 
--
� 
ml �� 
-
-.-
- -
So0 
---� �1---� �---+� bl 
c,1 
Sol 
;.
� J• ' ii 
i
�  
'{ n :-
.. 
-
-
The schematic and the optimised layout for the circuit are shown in Figure 6.32. The design 
has dimensions of 18.3µm x12.0µm occupying a chip area of219µm2. 
(a) (b) 
Figm·e 6.32 2-Bit Multiplier with Peripheral Multiplexers Per Array (a) Schematic 
(b) Layout
The simulation result for the 2-bit multiplier circuit is as outlined in Figure 6.33. 
i I• -·-··-·-·•-·•·•-•••m• 
J�� �------ ·- -�------' 
i� . .. L� '··· r-1-...r-··-: __ J
i� I . ' l,�. 
,�--- -�- ··-:J 
i..: ·-- '"""T'""'fI] rrn· ]'' """"···-1_...w·. ' ·�
I 
I . 1-: .  
. - - ... - - - --:.,-1 - - - - - - -
Figure 6.33 2-Bit Multiplier with Peripheral Multiplexers Per Array Simulation 
6.2.4.2 4-Bit Multiplier Circuit with Peripheral Multiplexers Per Array 
The 4-bit multiplier PerMuxMult4bit, is composed of a 4x4 array of MuxMult circuit with 
array periphery multiplexers. The floor-plan of this circuit is illustrated in Figure 6.34. 
75 
I • 
1 -: 
si3 a3 si2 a2 al 
So7 S<>11 S.. So, 
0 
cntrvO 
C,o 
bl 
cu 
s.,, 
b2 
C12 
So? 
b3 
C,3 
Figure 6.34 4-Bit Multiplier with Peripheral Multiplexers Per Array 
The optimised layout of the multiplier layout having dimensions of 3 l .7µmx22.0µm is 
illustrated in Figure 6.35 . 
Figure 6.35 4-Bit Multiplier with Peripheral Multiplexers Per Array Layout 
6.2.4.3 8-Bit Multiplier Circuit with Peripheral Multiplexers Per 4-Bit Array 
The schematic and layout for an 8-bit multiplier with peripheral multiplexers per 4-bit array 
block, PerMuxMult8bit, is shown in Figure 6.36. This is composed of a 2x2 array of 
PerMuxMult4bit circuit. The design has dimension of 63.5µmx44.lµm with an area of 
2800µm2. 
76 
(a) (b) 
Figure 6.36 8-Bit Multiplier with Peripheral Multiplexers Per Array (a) Schematic 
{b) Layout 
6.2.5 Transmission Gate (TG) Based Multiplier Circuit 
An alternative multiplier architecture is where Transmission Gate (TG) based full adder is 
utilised to create the multiplier circuit [l]. The advantage of this algorithm is its significant 
optimization of area due to lower number of transistors for the multiplier. Using TG based 
design a 1-bit adder only has 18 transistors, compared to 28 transistors for multiplexer 
based adder. 
6.2.5.1 1-Bit TG Based Multiplier Circuit 
The schematic and the optimised layout for the TG based 1-bit multiplier circuit, TgMult,
are shown in Figure 6.37. The 1-bit TG based multiplier has dimensions of 
5.72µmx3.97µm and occupies a chip area of22.7µm2. 
(a) (b) 
Figure 6.37 1-Bit TG Based Multiplier (a) Schematic (b) Layout 
The simulation result for the 1-bit multiplier circuit is illustrated in Figure 6.38. 
77 
·�·.- .••• ,. �·· ··•·•• ···-· ·�·�'r' • .. , ···� ......... -...--..-,' · - ------ ------ -
'-----'---
•-II0*\1N) 
Figure 6.38 1-Bit TG Based Multiplier Simulation Results 
6.2.5.2 2-Bit TG Based Multiplier Circuit 
The 2-bit multiplier block, TgMult2bit, is composed of a 2x2 array of TgMult circuits. 
Peripheral multiplexer are used to enable communication with the neighbour cells. The 
block diagram for this circuit is shown in Figure 6.39. 
cntrhO sil al siO aO 
L-----'-- Soo 
__ ..,__ __ ,___ bl 
Figure 6.39 2-Bit TG Based Multiplier Block Diagram 
The optimised layout of the circuit occupying a chip area of 15.3µmx10.5µm is illustrated 
in Figure 6.40. 
Figure 6.40 2-Bit TG Based Multiplier Layout 
78 
I 
1 -: , ·�-------------�. 
I ' 
1 -: _ ·: --.�l.,...,L _ __J_,..r�L-�·
1 : I -: ._1 ___ :........::====--=====:.-:::-=--�� 
0 
c,o 
j
c, 
I So3 
6.2.53 4-Bit TG Based Multiplier Circuit 
The 4-bit multiplier block, TgMult4bit, is composed of a 4x4 array of TgMult2bit circuit, 
and peripheral multiplexers, the block diagram of which is illustrated in Figure 6.4 1 . Larger 
arrays will be constructed based on this circuit. 
siO aO 
cntrvO 
C;o 
bl 
c,, 
s., 
b2 
� 
Sol 
b3 
c.i 
So3 
Figure 6.41 4-Bit TG Based Multiplier Block Diagram 
The optimised layout of the Multiplier circuit corresponding to a chip area of 
26 .6µmx 1 9.0µm is shown in Figure 6.42. 
Figure 6.42 4-Bit TG Based Multiplier Layout 
6.2.5.4 8-Bit TG Based Multiplier Circuit 
The 8-bit multiplier block, TgMult8bit, is composed of a 2x2 array of TgMult4bit circuit. 
The schematic and the optimised layout of the circuit are illustrated in Figure 6.43 . The 
design has dimensions of 53 .0µmx39.2µm. 
79 
(a) (b) 
Figure 6.43 8-Bit TG Based Multiplier (a) Schematic (b) Layout 
80 
6.3 Signed Multiplication 
Thus far the multipliers have been designed on the assumption that the inputs are unsigned 
binary data. Signed multiplication takes into consideration the possibility that the inputs 
maybe either positive or negative numbers. Signed numbers can be represented in various 
formats such as sign-magnitude or Two's complement (also written as 2 's  complement). A 
negative number -a is represented in 2 's complement format as: 
lal ' + 1 
The multipl ication algorithm for 2's complement integers is based on the same principle 
used for unsigned numbers with these differences 
• Theoretically for unsigned numbers we assume each partial product to be padded
with leading O's, whereas in the case of signed multiplication we must use leading
1 's.
• The last partial product is obtained by multiplying b(n- I )  by the 2 's  complement
representation of 'a', and finally representing the partial product, with the MSB
padded by one bit (the above partial products have to be padded accordingly).
• Hence the MSB of the result (PS in the following example), indicates the sign of the
multiplication.
As an example, the multiplication algorithm, for a 3x3 multiplication, is as follows: 
a2 a1 ao 
b2 b1 bo 
a2bo (pad) a2bo (pad) a2bo (pad) a2bo a1bo aobo 
a2b 1 (pad) a2b 1 (pad) a2b 1 a1b 1 aob 1 
((a2)'+ c 1)b2 (pad) ((a2)'+ C 1 )b2 ((a1 )'+ co)b2 ((ao)'+ l )b2 
a2bo+a2b1+ a2bo+a2b1 a2bo+a2b1 a2bo+a1b1+ a1bo +aob1 aobo 
((a2)'+c1)b2 +((a2)'+c1)b2 +(a1)'+co)b2 ((ao)'+l)b2 
PS P4 P3 P2 Pl PO 
Here c0 and c 1 are potential carries generated from the addition of 1 to the LSB of 'a' 
during the conversion of 'a' to its 2 's complement equivalent. Also other carries generated 
by the addition of the partial products are not shown in the above representation, but must 
be carried out as outlined in calculations of section 3 .2 .  
In the following sections, for the purpose of comparison multiplexer based and TG based 
multipliers will be used to realise the signed multiplier arrays. 
6.3.1 Signed Multiplexer Based Multipliers 
6.3.1 .1 2-Bit 2's Complement Multiplexer Based Multiplier Circuit 
The 2-bit 2's complement multiplier block, Mul2Comp2bit, is composed of a 3x2 array of 
Mu.xMult circuits. The extra column of multipliers is necessary to cater for the final partial 
product where the MSB is padded by one bit. The output of this column indicates the sign 
of the multiplication as explained previously. Perimeter multiplexers are used within the 
8 1  
basic array for communication to adjacent arrays. As before, the basic array size 
implements 4x4 multiplication (hence a 5x4 array size), however 2x2 multiplication will 
also be designed. 
The last partial product is calculated by multiplying b1 with the 2's complement of a, this is 
implemented by putting a series of multiplexers, Mux3, above the last row of the multiplier 
array, and one multiplexer to the right of the this row. The two inputs to Mux3 are a or a'. 
The control signal cntri of these multiplexers chooses whether a (cntri = 0), or a' (cntri = 1 ), 
is fed into the multiplier row below. 
If the control signal of the right hand side multiplexer cntra=O, then Ci = co of previous 
stage, which implies signed multiplication is not occurring within this array. However if 
cntra = 1 ,  then ci=bN-I (bN-l is always 1 if the binary number b is negative), which means 
that this array implements signed multiplication . 
We need to bear in mind that this array may be a single array, or it may be one of many 
arrays within a larger circuit. The functionality of the array configuration is as follows: 
• If this array is the only component of the circuit, or is located at the most bottom row of
a larger array, then the last row of multipl iers must compute the last partial product, i .e.
the multiplication ofb(n- 1 ) by 2's complement of a ,  hence cntri = 1 .
• If this array is located anywhere else within the circuit, then the last row of multipliers
do not compute the last partial product, hence cntri = 0.
• Sis is the signed sum input coming from above cells. 
The logical floor-plan of this architecture is displayed in Figure 6 .44. 
cntrhl 
Sol 
cntrvO 
bO 
ciO 
�--+- Soo
Cin 
So1 
Figure 6.44 2-Bit 2's Complement Multiplexer Based Multiplier Block Diagram 
The internal multiplexer Mux3, is designed similar to that of Mux and Mux2. The 
schematic and the optimised layout for the circuit Mux3 are i llustrated in Figure 6.45. The 
multiplexer layout has dimensions of3.65µmx l .97µm. 
82 
(a) (b) 
Figure 6.45 Internal Multiplexer (a) Schematic (b) Layout 
The simulation result for the multiplexer circuit is shown in Figure 6.46. 
__ __ ... _ _ _  ____,,
r-
t:_ 
i�! .1--l_ J 
Figure 6.46 Internal Multiplexer Simulation Results 
Based on the block diagram of Figure 6.44, the schematic and the corresponding optimised 
layout, having dimensions of 24.8µmxl4.4µm, for the 2-bit 2's complement circuit are 
shown in Figure 6.47. 
(a) (b) 
Figure 6.47 2-Bit 2's Complement Multiplexer Based Multiplier (a) Schematic (b) 
Layout 
83 
. • -' =
r 
The simulation result for the multiplier circuit follows in Figure 6.48. 
_1 ___ ! _..__0_.__,  __ __., 
- - - - - - -=-:i-� - - - - - - -
Figure 6.48 2-Bit 2 's Complement Multiplexer Based Multiplier Simulation 
6.3.1.2 4-Bit 2's Complement Multiplexer Based Multiplier Circuit 
The 4-bit 2's complement multiplier block, Mul2Compt4bit, is composed of a 5x4 array of
Mult2Comp circuit, and peripheral as well as internal multiplexers. The logical floor-plan 
of this circuit is shown in Figure 6.49. Larger arrays will be constructed based on this 
architecture. 
Figure 6.49 4-Bit 2 's Complement Multiplexer Based Multiplier Block Diagram 
The optimised layout of the circuit having dimensions of 38.7µmx24.4µm is shown in 
Figure 6.50. 
84 
Figure 6.50 4-Bit 2's Complement Multiplexer Based Multiplier Layout 
6.3.1.3 8-Bit 2's Complement Multiplexer Based Multiplier Circuit 
The 8-bit multiplier block, Mu/2Compt8bit, is composed of a 2x2 array of Mul2Compt4bit 
circuit. The schematic and the optimised layout of the circuit having dimensions of 
77.4µmx49.3µm with silicon coverage of 3816µm2 are illustrated in Figure 6.51. 
(a) (b) 
Figure 6.51 8-Bit 2's Complement Multiplexer Based Multiplier (a) Schematic (b) 
Layout 
6.3.2 Signed TG Based Multiplier Circuit 
Using the same algorithm as discussed in section 6.3.1, the equivalent signed multiplication 
is implemented using TG based multiplier components, and 2's complement representation. 
6.3.2.1 2-Bit 2's Complement TG Based Multiplier Circuit 
Figure 6.52 illustrates the block diagram for the circuit TgComp2bit, a 3x2 array of 
multiplier circuit TgMult, together with perimeter and internal multiplexers. 
85 
�-ero -·· c::::2'"6u- - --c;:::t�· · c::::7 �u· -,- · 1::1 rbu  . ., 
�- Olft--- _ =--
...,. 
�' � . . • . -c:::::; t.,g . = "a»fl -. . . - r:s:a: ca,o �� c=<::> �D.b-,- d, O� �c:5-=:,-�6:r-6 � 
,:-.;.,-\..:..:., �-- � _:_.,.. _ :,.;- i::::...;.::, l- ·�-::; c::.....;::p �=:. ·c.....:...::1 : - '"' 
.
. l - �  
r=j�· •�  �D:� �� .' � -, :�,� -•�,���-�.��:
c:c,aj, 6·d �oa·d, �o�·d, �<>6,·d �P¢,·� �  
::: ·.- _ :: �� �- --�.:�-Cf �,�-"., .·- �� ·:: . '��m!,-'-= � �= _-:� =a . . . '� . 
06>��-������-� ��� :: �:.:=�  ..�-��:�;�-��'. 
.. - - · -· -·  · -· ... .  _ .. :: -- -- - ·  -�- --·  -�- .., . .  ·.l:C.alll. 
Oil, c.!,· c:,,:, � ��ear, �E>�� �D¢,;•,d, �D@•c:b � 
:;: _:; -��. :-:1��. :·:;.-�  :�� �-�-�
-=-
-� 
,(li;»,00, .<;;l""�- �� §.=��-��; � ��"' 
Figure 6.52 2-Bit 2's Complement TG based Multiplier Block Diagram 
The schematic and the optimised layout is of the circuit are shown in Figure 6.53. The 
layout has dimensions of21.0µmx13.2µm. 
(a) (b) 
Figure 6.53 2-Bit 2's Complement TG Based Multiplier (a) Schematic (b) Layout 
The simulation result for the 2-bit signed multiplier circuit follows in Figure 6.54 . 
..... ---�-
/,1 It•,..., Iii! ii ! -') � . ..;; -j.� 
' '[ t-: ----- -
• _:i r--1 1·-1 ii � f ,l i ,_.....; ·-· -1 - :_,_
1-:f-:- --
i-:[ ...... . 
i-: I 
•·11-·r-1 t-:1 .. J l .I
. ' 
t-:��������  
i-:L��-. ..,....-��·- - -.-� -. -- -· 
i-:1 
,-....--..........--.--......-.··�--�--,-....----,�-- �-.-... -......... . - -- - - - -- ­-t-,1•-1 
Figure 6.54 2-Bit 2's Complement TG Based Multiplier Simulation Results 
86 
<nll'h l ,;, Ii i  a l  •iO aO 
6.3.2.2 4-Bit 2's Complement TG Based Multiplier Circuit 
The 4-bit 2's complement TG based multiplier block, TgCompt4bit, is composed of a 5x4 
array of TgMult circuit, peripheral and internal multiplexers. The block diagram of this 
circuit can be found in Figure 6.55. Larger arrays will be constructed based on this circuit. 
Figure 6.55 4-Bit 2's Complement TG Based Multiplier Block Diagram 
The optimised layout of the multiplier circuit occupying a chip area of 32.4µmx21.7µm is 
illustrated in Figure 6.56. 
Figure 6.56 4-Bit 2 's Complement TG Based Multiplier Layout 
87 
6.3.2.3 8-Bit 2's Complement TG Based Multiplier Circuit 
The 8-bit multiplier block, TgComp8bit, is composed of a 2x2 array of TgComp4bit circuit. 
The schematic and the optimised layout of the circuit are shown in Figure 6.57. The design 
layout has dimensions of 64.2µmx43.9µm, occupying a chip area of2818µm2. 
(a) (b) 
Figure 6.57 8-Bit 2's Complement TG Based Multiplier (a) Schematic (b) Layout 
6.3.3 An Alternative Algorithm for Implementing Signed Multiplication 
An alternative algorithm for obtaining signed multiplication is available [63]. The 
algorithm for computing multiplications for two 2's complement binary numbers is based 
on the following equation: 
(a3bl)' 
(a3b2)' a2b2 
a3b3 (a2b3)' (al b3)' 
1 0 0 1 
P7 P6 PS P4 
a3 
b3 
(a3b0)' 
a2bl 
alb2 
(a0b3)' 
0 
P3 
a2 
b2 
a2b0 
albl 
a0b2 
0 
P2 
al aO 
bl bO 
albO aObO 
aObl 
0 0 
Pl PO 
output's ((n 2) /)th 
equation 6.2 MSB position 
This algorithm is as follows: 
• The MSB of each row of partial products is inverted.
• In the last row of the partial products every bit except the MSB is inverted.
• A constant is added in the last row (shaded grey). The constant is a binary number
with value '1' for the bit aligned to the ((n/2)+ 1 )u, position of the output result, and
another '1' aligned to the MSB of the output result. For example for a 4x4
multiplication the output will be 8 bits, hence the ones are located in the last and the
(8/2)+1 = Su, position, i.e. the constant is 10010000. Similarly for an 8x8
88 
multiplication the constant would be 100000100000000 with ones in (16/2)+ 1 =9th 
and the last positions. 
The following block diagram implements the algorithm proposed for a 4x4 multiplication. 
cntrn 
a3 a2 al aO 
b 
cntrv 
bO 
ciO 
PO 
bl 
cil 
Pl 
b2 
ci2 
P2 
b3 
ci3 
[P6 I PS IP4 IP3 
P7 
Figure 6.58 Improved Algorithm for Signed Multiplication 
The logical design of Figure 6.58 is implemented in the following manner: 
• Multiplexers 0, 1 and 2 receive input from above cells if present (when cntrv = 1) or
otherwise output O's (cntrv = 0).
• Mux3 first inverts its 'a' input if cntrv = 1, the reason for which will become apparent
shortly.
• The cells TgMult are 1-bit TG based multipliers similar to our multiplier cells of
section 6.71. In our example of equation 6.2, these calculate all normal partial products
which are shaded blue.
• The cells TgNandMultMux are 1-bit TG based multiplier cells integrated with a NAND
gate and a multiplexer with control signal cntrh.
• If cntrh = 0 then the multiplier computes (axb) + si + ci (unsigned computation).
• If cntrh = 1 the cell computes (axb )'+si+ci for a signed calculation.
• The TgNandMultMux circuit is used for perimeter cells only. Hence in equation 6.2 it
computes the MSB of each partial product row, for all rows, and all partial products in
the last row (shaded yellow).
• To add the 1 's of the constant, for position (n/2)+ 1, Mux4 is used. If the control signals,
cntrv = 1 and cntrh=O, then '1' needs to be added. This means that this cell is the top
left cell corresponding to bit (n/2) + 1. This can also be accomplished by negating co,
hence the output ofmux4 = co', else the output = co.
• Mux 5 to 11 act the same as the perimeter multiplexers used before. To add the last '1'
of the constant, Mux 12 incorporates a multiplexer and an inverter into one circuit,
89 
since adding a one to a bit, produces the same result as negating that bit. This is why 
mux3 inverts its input as it is coming from output of mux12 from cell above. 
This algorithm significantly reduce the chip area, as the additional column of multipliers 
needed to represent the sign of the result are not needed, so for a 4x4 multiplication, a 4x4 
array of multipliers are needed instead of a 5x4 array in the previous architecture. 
The designs in the following sections are implemented using TG based multipliers, as they 
are the most area efficient compared to the symmetric based and the multiplexer based 
algorithms. The following sections will outline the design process for the components 
necessary to implement the improved TG based signed multiplication. 
Inverter cells are needed to realise this algorithm. The schematic and the optimised layout 
for the inverter circuit are illustrated in Figure 6.59. 
(a) (b) 
Figure 6.59 Inverter (a) Schematic (b) Layout 
The simulation result for the circuit follows in Figure 6.60. 
,-
t: 
Figure 6.60 Inverter Simulation Results 
Also needed id the multiplexed half adder circuit which consists of an inverter, AND cell 
and a multiplexer Mux4, as shown in Figure 6.61. 
90 
I output 
Figure 6.61 Multiplexed Half Adder Circuit Block Diagram 
The schematic and the optimised layout for the multiplexed half-adder circuit having 
dimensions of8.4µmxl.9µm, are illustrated in Figure 6.62. 
(a) (b) 
Figure 6.62 Multiplexed Half-Adder (a) Schematic (b) Layout 
The simulation result for the circuit is shown in Figure 6.63. 
-------- -- ---··--·-- ----,,, 
i-
.--..... -.... -...--.... -..--..-----.... -·.....-..--....-._ .... __,.._ .... _ .... _..,.._..,._..,._,. 
- • - - - - •- •- , .. - _.,_H_I 
�• - - - - •• - - --
Figure 6.63 Multiplexed Half-Adder Simulation Results 
6.3.3.1 Multiplier Circuit with NAND Capability 
The circuit TgNand.MultMux, is a 1-bit TG based multiplier (TgMult), integrated with a 
AND/NAND gate as well as a multiplexer with control signal cntrh, shown in Figure 6.64. 
If cntrh = 0 then the multiplier computes (axb)+si+Ci. However if cntrh = 1 the cell 
computes (axb)'+si+Ci. This circuit is used for perimeter cells within the basic array of each 
configuration. 
91 
1 -
f -
t _ _  ; P; ll;'L • __ tL ' p � >.:i _ �:11 
I J , , r 
c:::__: p: cc.=---=LC�- E:::_ .--_· l '---=-- p 
I 
S; 
Co 
So 
Figure 6.64 Block Diagram of Multiplier Circuit with NAND Capability 
The schematic and the optimised layout for the circuit are illustrated in Figure 6.65. The 
design has dimensions of 6.84µmx4.21µm. 
(a) (b) 
Figure 6.65 Multiplier Circuit with NAND Capability (a) Schematic (b) Layout 
The simulation result for the I-bit multiplier circuit can be found in Figure 6.66. 
•.. 
... """! -�· ... -·-
l 
i 
1-: 
, · ..,,-rrnTIJJ·-,-rITII]J, • IJ1· -.]Tf'"..-TTIT 
,- i 
·
1
 
1 
,
 
i 'I I 1: : i 
  
! I I! i l • L... -w. Ll..U U.1..L.U "'·· -.. Ll..J..1. .. \-, . ....... .  
i ,. 
t-: 
i • 
I-·. � 
1 ' ,-----······--·· ···----·-·-
;- ·�--- --------- ---- ----------�·:
i�t1r·r! ... r1. J.1-11._.Jr"Tr--·T1 ... J
i�L.L .. --- - -- --- . . . -
-
- - ] 
--·- -��-·---- �.---·· •. ·- ---·- --·-·· - ----. ·--···�· - ·---- •• "'T'"""' ........... . ·- - -. - - - - - - . ..  ·- ''"' , ............ 
-(lo)fT"") 
Figure 6.66 Multiplier Circuit with NAND Capability Simulation Results 
92 
cntr 
c, 
----�� 
' .
6.3.3.2 Improved 4-Bit 2's Complement Multiplier Circuit 
The improved 4-bit 2's complement TG based multiplier block, TgCompNandt4bit, is 
composed of a 3 x3 array of 1-bit TG based multipliers (TgMult),a row and a column of 
multiplier circuits with NAND capability (TgNandMultMux), and peripheral multiplexers, 
as shown in Figure 6.67. Larger arrays will be constructed based on this circuit. 
cntrh 
a3 a2 al aO 
cntrv 
bO 
ciO 
PO 
bl 
cil 
Pl 
b2 
ci2 
Pl 
b3 
ci3 
IP6 IP5 IP4 IPJ 
Figure 6.67 Improved 4-Bit 2's Complement Multiplier Block Diagram 
The schematic and the optimised layout of the circuit are illustrated in Figure 6.68. The 
layout has dimensions of30.7µmxl9.3µm. 
(a) (b) 
Figure 6.68 Improved 4-Bit 2's Complement Multiplier (a) Schematic (b) Layout 
The simulation result for the circuit is shown in Figure 6.69. 
93 
AvaiW•e 1999.2 (1999061 S) -·--
,ii 1'J:?J
- ... _......_.II_ ..... , 
jl 
11 
31 l i I : 
11 W· 
11 0 
jl 
11 <>---- ----- --
31 
·-
i ! 
.. .. 
- - - --.... 
! 
I l 
-
' -
I 
I 
I 
.. I
I 
I 
I 
l 
�I .... �----t • ·t· · t· ........... -t • - • -1---·�- • t·- •!•···,- ..... �--- - I 
,� r so.: ,� 
r --: 
J r -= ,� 
I, 
31 
jl l I 
jl .
31 ' 
jl 
31 I I 
jl . ... 
: l I : i 
I . 
l .J. 
, ... .... - .... -
I 
! I i j I ' I I ; l ' 
I 
I I 
I 
l I ' I ' I 
I - - ..... - - .... .... - ,... 
TIIM�)(TlNfl 
Figure 6.69 Improved 4-Bit 2's Complement Simulation Results 
6.3.3.3 Improved 8-Bit 2's Complement Multiplier Circuit 
The 8-bit multiplier block, TgCN8, is composed of a 2x2 array of TgCompNand4bit circuit. 
The schematic and the optimised layout of the circuit are shown in Figure 6.70. The design 
has dimensions of 62.1 µmx39.4µm, and occupies a chip area of 2447µm2.
(a) (b) 
Figure 6.70 Improved 8-Bit 2's Complement Multiplier (a) Schematic (b) Layout 
94 
, �---================ '� -====================:::::::::=======1 -: --==================
I -" 
r ...; 
I r _;:;----:::::, =============================--r ...: �=====-==--====-::::-:====:;::::--=::-:::::-::=-:=:;:::::--==--=--
-
6.4 Observations and Comparison 
Table 6.5 shows a comparison between the silicon utilization of the basic 1-bit multiplier 
structures for the three multipliers studied. 
Table 6.5 Summary and Comparison of the Basic 1-Bit Multiplier Structures 
From Table 6.5, it can be observed that the most area efficient circuit configuration with the 
least number of transistors for the basic 1-bit multiplier, is the TG based multiplier 
followed by the multiplexer based and finally the symmetric based multipliers. 
Table 6.6 outlines a summary and comparison between all the circuit configurations 
designed for unsigned multiplication. These are arrays made up of the basic 1-bit 
multipliers outlined in Table 6.5. The overhead associated with the peripheral circuitry 
needed for cell communication results in increased chip area. Blank entries imply that the 
circuit in that specific word length is not realizable. 
Table 6.6 Summary and Comparison of Area for Array of Unsigned Multiplier 
Structures 
Analysing table 6.6 for the 8-bit configuration, the following observations can be made for 
the unsigned multiplier designs: 
• The most area efficient multiplier configuration is the TG based multiplier with a
reduction of almost 722µm2 compared to its multiplexer based counterpart.
• Although the multiplexer based multiplier with array periphery multiplexers is
larger than that of the symmetric based multiplier, the relative increased area is due
to the overhead penalty of the peripheral interconnection circuitry.
95 
Circuit Transistor count Dimensions (um) Area (µm2) 
34 6.4x4.8 30.72 Symmetric based 
multiplier 
34 6.lx4.4 26.84 Multiplexer based 
multiplier 
TO based 24 5.72x3.97 22.3 
multiplier 
Unsigned Multiplier Configuration 1-Bit
Area
(µm2)
2-Bit
Area
(µm2)
4-Bit
Area
(µm2)
8-Bit
Area
(µm2)
Symmetric based multiplier 30.7 131.6 547.4 2176.9 
Multiplexer based multiplier 26.8 124.0 522.2 2042.2 
59.9 259.5 1036.6 4213.7 Multiplexer based multiplier with 
periphery multiplexer per cell 
- 219.6 697.4 2800.4 Multiplexer based multiplier with 
periphery multiplexer per array 
22.3 160.65 505.4 2077.6 TG based multiplier with periphery 
multiplexer per array 
• Interestingly, the TG based multiplier with periphery multiplexer per array, is
smaller than that of the symmetric based multiplier by almost 100 µm2, even though
the latter has no periphery circuitry.
Table 6.7 highlights a summary of all the signed multiplier circuit configurations. As with 
the previous table, blank entries imply that the circuit in that specific word length is not 
implementable. 
Table 6.7 Summary and Comparison of Area for Signed Multiplication Using 
0.13um, 1.2V Process 
From table 6. 7, it can be seen that: 
• Using the original algorithm [62], the TG based 2's complement multiplier is more
area efficient than its multiplexer based counteipart.• The improved TG based 2's complement multiplier [63], is the most area efficient
signed multiplier, with a reduction of approximately I370µm2 compared to the
multiplexer based 2's complement multiplier, and a reduction of approximately
370 µm2 compared to the TG based 2's complement multiplier.
Referring to the design of the serial/parallel multiplier in chapter 6, it is appropriate now to 
make a comparison between the serial/parallel and parallel/parallel configurations, relative 
to chip area. Table 6.8 outlines a comparison of area for the unsigned serial/parallel and 
parallel/parallel multiplication using 0.13µm, 1 .2V process. 
Table 6.8 Summary and Comparison of Area for Unsigned Serial/Parallel and 
Parallel/Parallel Multiplication Using 0.13µm, 1.2V Process 
From table 6.8, it can be observed that: 
96 
Signed Multiplier 
Configuration 
1-Bit
Area
2-Bit
Area
4-Bit
Area
8-Bit
Area
(µm2) (µm2) (µm2) (u.m2) 
- 357.12 944.3 3815.8 Multiplexer based 2 's 
Complement multiplier 
- 277.2 703.0 2818.38 TG based 2's Complement 
multiplier 
- - 592.5 2446.74 Improved TG based 2's 
Complement multiplier 
Dimensions Area (µm2) Multiplier 
Configuration 
Multiplier 
Configuration (µm) 
Serial/Parallel SM32bit 148.2x14.0 2074.8 
SymMult8 51.lx42.6 2176.9 
MuxMult8bit 50.3x40.6 2042.2 
Parallel/Parallel Mult&Mux8bit 70.7x59.6 4213.7 
PerMuxMult8bit 63.5x44.l 2800.4 
TgMult8bit 53.0x39.2 2077.6 
• The 8-bit parallel/parallel TG based multiplier and the 8-bit serial/parallel multiplier
are almost equal in terms of area efficiency.
• The TG based configuration is faster than the serial/parallel multiplier as for the
latter, based on its serial structure, many clock cycles has to pass for a result to be
produced.
6.5 Projection of Technology into 0.09µm (90nm) Process 
Projecting design dimensions, identifies future utilization of various technologies, which 
forms a roadmap, forecasting the advances and functionality of future process technologies. 
To project a design dimension into another technology the following formula can be used 
T2 = (w2/w1) x T 1 equation 6.3
where T 1 = current technology T 2 = projected technology 
w1 = current chip dimension/area w2 = projected chip dimension/area 
The design dimensions using the 0.13µm (130nm) 1.2V process technology, may be 
projected to the 0.09µm (90nm) 1 V technology. Using equation 6.3, the new dimensions 
are outlines in Table 6.9. 
97 
Table 6.9 Estimated Area Projection for Multiplier Configurations, from 0.13µm 
1.2V to 0.09µm lV Process Technology 
Figure 6.71 outlines the projection characteristics indicated in Table 6.9. The various 
configuration entries in the table correspond to each graph. 
98 
Bit 
Configuration Length 
0.13µm Tech 
Area 
(µm
2
) 
0.09µm Tech 
Area Projection 
(µm
2
) 
(a) Symmetric based multiplier lxl 30.7 21.5 
2x2 131.6 91.7 
4x4 547.4 382.9 
8x8 2176.9 1523.8 
lxl 26.8 18.8 (b) Multiplexer based
multiplier 2x2 124.0 86.8 
4x4 522.2 365.5 
8x8 2042.2 1429.5 
lxl 59.9 41.9 
2x2 259.5 181.6 
( c) Multiplexer based
multiplier with periphery
multiplexer per cell 4x4 1036.6 725.6 
8x8 4213.7 2949.6 
lxl - -
2x2 219.6 153.7 
(d) Multiplexer based
multiplier with periphery
multiplexer per array 4x4 697.4 488.2 
8x8 2800.4 1960.3 
lxl 22.3 15.6 ( e) TG based multiplier with
periphery multiplexer per array 2x2 160.65 112.5 
4x4 505.4 353.8 
8x8 2077.6 1454.3 
lxl - -(f) Multiplexer based 2's
complement multiplier 2x2 357.12 249.9 
4x4 944.3 661.0 
8x8 3815.8 2671.0 
lxl (g) TG based 2's complement
multiplier 2x2 277.2 194.0 
4x4 703.0 492.1 
8x8 2818.38 1972.9 
lxl - -(h) Improved TG based 2's
complement multiplier 2x2 - -
4x4 592.5 414.8 
8x8 2446.74 1712.7 
(i) Serial/Parallel multiplier 2x2 519.5 363.6 
4x4 1037.3 726.1 
8x8 2074.8 1452.4 
5000 5000 
I 
5000 
I 
I 
I I 
4000 
I I 
4000 
I I ------,-----,------ ------,-----,------ 4000 
I I 
I 
,.-.._ 
N I 
E ,.-.._ I I N N 
:i 3000 ------�-----,------ E 3000 ------�-----,------ E 3000 '--' "' :i :i 
� 
'--' '--' "' "' 
� I � 2000 ------r-----1------ 2000 ------.------1------ 2000
I I ' I 
1000 1000 1000 
0 0 0 
2 4 8 2 4 8 2 4 8 
Bit Configuration Bit Configuration Bit Configuration 
(a) (b) (c) 
5000 5000 5000 
4000 
I I 
4000 
I I I I ------,-----,------ ------,------,------ 4000 ------r-----,------
,.-.._ ,.-.._ c0 N N 
E 3000 ------,-----�------ E 3000 ------,------�------ E 3000 
:i :i :i '--' '--' '--' "' "' "' 
� 
(!) (!) 
2000 < 2000 ------1------41------ < 2000 - - - - - -.- - - - - - 41- -
I I I I ' I 
1000 1000 1000 
0 0 0 
2 4 8 2 4 8 2 4 8 
Bit Configuration Bit Configuration Bit Configuration 
(d) (e) (f) 
5000 5000 5000 
4000 
I I 
4000 
I I 
4000 
I I ------,-----,------ ------1------,------ ------,-----,------
I I I 
I 
,.-.._ ,.-.._ ,.-.._ 
N I N I "'e E 3000 ------,-----,------ E 3000 ------,------,------ 3000 ------r-----,------
:i :i :i 
'--' '--' '--' "' "' "' 
(!) (!) (!) < 2000 ------�-----41---- - < 2000 ------,------41----- < 2000
I I 
1000 1000 1000 
0 0 0 
? .  4 8 ?. 4 8 ?. 4 8 
Bit Configuration Bit Configuration Bit Configuration 
(g) (h) (i) 
KEY 0.13µm Technology ---- 0.09µm Technology 
Figure 6.71 Projection Characteristics for 130nm to 90nm Process 
99 
••
I 
I I 
· - - - - • -- - - - :- - - - - - �
I 
I 
I 
- - - :.. - - - - - � - - - -
I I 
I 
- -� - - - - - � - -
I 
I 
- - - - -� - -
I 
- - -:... - - - - - �I 
I 
I 
I 
- - - - - -:- - -
I 
I 
- -:- - - - - - � - -• 
From Table 6.9 and Figure 6.123, projection into the 90nm 1 V technology form the 130nm 
1.2V technology, will result in reduced area consumption and silicon optimization. 
6.6 Conclusions 
In this chapter various implementations of parallel/parallel multiplier configuration, using 
basic multiplier designs of chapter 3 and 5 were realised. For example comparing the 8-bit 
configurations, the most area efficient design was the TG based multiplier, and having 
overhead penalty, it was still smaller in size than the symmetric multiplier, which was only 
implemented through hardwiring rather than configurability. 
100 
Chapter 7 
Implementation of Configurable ALU 
Architecture and Future Direction 
7.0 Summary 
In this chapter, a revised floor-plan for the ALU is proposed, with the expectation that 
future research in the area will facilitate implementation of the 3-D Soft Chip. Silicon chip 
area projections to 90nm technology are also made, implying that the physical realisation is 
an option, particularly with scaled technologies (TDSM). 
7.1 ALU Implementation 
The ALU proposed in Chapter 2 requires additional primitives such as configurable 
compare primitive, storage, as well as other types of control logic. 
7 .1.1 Comparator 
The comparator within the ALU, performs crucial functions and is necessary for correct 
processing of the system [62]. It consists of: 
• An adder-subtracter circuit
• An AND-OR circuit
Figure 7 .1 illustrates the block diagram for a 4-bit adder-subtracter circuit. 
a3 b3 
NIAdde,-
s3 
a2 b2 
NIAdde< 
s2 
al bl 
FullAdde< 
s I 
aO bO 
FuUAdd..-
so 
Figure 7 .1 Adder-Subtracter Block Diagram 
C,n 
The configurable adder-subtracter circuit consists of full adder circuits, top multiplexers 
which allow b or b' to be inputted into the full adder, and right hand side multiplexer which 
places the adder-subtracter in the addition or the subtraction mode. The functionality is as 
follows: 
• For the top multiplexers, if cntra = 0, then the output of the multiplexer is bi, hence
the cell acts as an adder.
101 
• If cntra = 1, then multiplexer output = b;', hence the cell is in subtraction mode.
This acts as the first step in converting a binary number into its 2's complement
representation, which is to invert the bits.
• If cntra = 1, i.e. the adder-subtracter is in subtraction mode, then the right hand
multiplexer outputs cntra, i.e. 'l ', into the full adder. This acts as the second step in
converting a binary number into its 2's complement representation, which is to add
1.
The I-bit full adder circuit, Ful/Add, consists of a TG based full adder, and a top 
multiplexer. The inputs to the Full adder circuit are ai, bi, Ci and outputs So and Co. Figure 
7.2 illustrates the block diagram for this circuit. 
Co-- FuUAdder __ Ci 
Figure 7 .2 Full Adder Block Diagram 
The schematic and the optimised layout of the circuit are shown in Figure 7.3. The layout 
has dimensions of6.74µmx3.9lµm. 
(a) (b) 
Figure 7.3 Full Adder (a) Schematic (b) Layout 
The simulation result for the circuit follows in Figure 7.4. 
102 
a, b, 
Ti: ' r·: 
i ' 
t -: 
J:.J 1 ... J 1 I : r I I i I L .. J U 
,--�����������-., 
I �: : L r l_ .. _J u··T ·-- LJ-L 
Figure 7.4 Full Adde1· Simulation Result 
7.1.2 AND-OR Circuit 
The AND-OR block, AndOr, is composed of a 4 input OR circuit, the output of which is 
ANDed with another input Ci. The schematic and the optimised layout of the circuit are 
shown in Figure 7.5. The design layout dimensions are 4.26µmx3.9µm. 
(a) (b) 
Figure 7.5 AND-OR (a) Schematic, (b) Layout 
Figure 7.6 illustrates the simulation result for the circuit. 
••�t-1 i I 
:.:: 1-:� �: 
i ' 
1-·-
i ' 
1-: 
i ' 
f-: 
--·1 i-1 _ 
Figure 7.6 AND-OR Simulation Result 
103 
i 
J - 1-------------------'I
i 
I -;  
The 4-bit comparator circuit, Comparator, is composed of 4 FullAdd circuits, 4 
multiplexers on top of the full adders, an And Or circuit, and one multiplexer on the right 
hand side of the circuit selecting the mode of operation. It inputs two 4-bit numbers a and b, 
and compares them. If a < b then output = 1, else output = 0. The schematic and the 
optimised layout of the circuit are shown in Figure 7.7. The design layout dimensions are 
30.9µmx4.3µm. 
(a) (b) 
Figure 7.7 4-Bit Comparator (a) Schematic (b) Layout 
The simulation result for the circuit can be found in Figure 7.8. 
-._._._,-·-·.........,, 
.... iw, 
.... "1,,,i 
-·(WI -.. , ... , ......., .. -=:: 
· ---------------­-... ,, ..... --.c...-----
Figure 7.8 4-Bit Comparator Simulation Result 
7.1.3 ALU Pixel Floor-Plan 
The original ALU floor-plan, suggested in chapter 2, is modified to include additional 
elements such as configurable compare primitives and control logic as well as adder­
subtracter, multiplier, and register components. The exact nature and functionality of the 
control logic circuitry needs to be defined based on the JCS configuration and 
specifications, identified as part of future research. Hence only an approximate estimation 
will be made for the required area of the design layout. Figure 7.9 illustrates the modified 
ALU configuration with the expected dimensions of 40µmx40µm, using the 0.13µm 
technology. 
104 
i-:1 
i d;f 
..,.._1iiiin•,1 ,, 
f-:1r-, i��---,--------,---,------,--..,------,i
iff 
i-:J 
f-:�� � � � �� � � �-----,1 
f� I -;....;=:;.._:::;::___;;='-=":.::.::.::::====--=:..:.:;__:
1111 
IOµm 
.., ..-----
30
....;.
µm ____ -+
4-Bit Multiplier 19µm 
Reg/ 
Mem 
4-Bit +/-
4-Bit Comparator
Control Logic 
Figure 7.9 4-Bit ALU Pixel Floor-Plan (-40µmx40µm) 
40µm 
4-Bit Multiplier
40µm 4-Bit Add/Sub
Control Logic 
Figure 7.10 4-Bit ALU Pixel (�Oµmx40µm) Using 0.13µm Technology 
Figure 7.10 illustrates the layout configuration for the proposed 4-bit ALU pixel. It consists 
of a 4-bit TG based 2's complement multiplier, a 4-bit adder/subtracter, a 4-bit comparator 
circuit, the control logic unit, and the register which is integrated with the Register/Memory 
(Reg/Mem). The ALU will occupy approximately 1600µm2 of the chip area. 
105 
Reg/ M m 
ompllfator 
7 .2 Future Direction and Implementation Strategy 
The proposed strategy for implementation of the Sea-of-Pixel Elements (PE), for the 3-D 
chip is illustrated in Figure 7 .11. 
1.5µ..r 
4-Bd .b:m.un 
&.imp Bw 40µ.m 
hdm:xrnned .-.I� ... � ----
ln!ema.l Bus 
Indium Bump 
20µ.m 
et'1iu on 
Pad 
-
-
-
Figure 7.11 Proposed 3-D Soft Chip Floor-Plan 
This shows the construct for a 4-bit bus, vertically connected to the top ICS chip through 
the Indium bump interconnection, which would allow 4-bit data to be loaded into each PE 
of the CAP chip. The limitations imposed by the Indium bump physical dimensions, limits 
the number of connection points per 40µmx40µm Pixel Element. 
The effective silicon area needed for each PE cell is: 
40µm (PE size)+ 20µm (gap between 2 PEs) = 60µm 
This means that within present Indium bump technology limitations, a lOmmxlOmm chip 
can support about 25,000 4-bit configurable ALU's, (i.e. (10,000/60)2 PEs). This includes 
allocation of l mm on two corresponding sides of the chip for interconnecting pads between 
the 3-D Soft Chip and the Opto-VLSI beam processor as shown in Figure 7.12. 
Since the proposed Opto-VLSI beam processor is likely to have dimensions of 
20mmx20mm, based on the above calculations approximately 100,000 4-bit ALU's can be 
accommodated. Therefore significant processing capacity for multimedia applications and 
coefficients computation required for holograms generation, can be expected. 
106 
I µz 
Liquid Crystal Glass Cover 
Interconnection for ......._ ���mll!E!':::::=:!!••liL� Data Transfer between "' • 
3-D Soft-Chip and
Opto-VLSI Beam
Steering Processor 
Aluminium 
Package 
Indium bump 
Figure 7.12 Physical Implementation of 3-D Beam Steering Switch 
7.3 Projection of Technology 
The estimated 4-bit pixel dimensions using the 0.13µm (130nm), l.2V process technology, 
may be projected to the 0.09µm (90nm), 1 V technology. The new pixel dimensions will be: 
W2 = (90 X 40)/130 = - 28µm 
Similarly the dimensions for the 4-bit pixel may be projected to a smaller technology. To 
support a 1 Oµmxl Oµm configurable 4-bit ALU the enabling technology is: 
(wz/w1) x 130nm = (10/40) x 130nm = 32.5nm =>35nm 
According to the technology roadmap discussed in chapter 1, the 35nm process technology 
is likely to be available around 2013 which implies that there is a new possibility for 
implementation of intelligent "Sea-of- Pixels" architecture in TDSM in near future. 
7 .4 Conclusions 
This research program provided a very successful outcome, highlighting that it is possible 
to accommodate 100,000 processing elements each having dimensions of 40µmx40µm 
within a die size of 20mmx20mm. This is consistent with the projected dimension of the 
Opto-VLSI beam steering chips. This implies that the utilisation of the proposed arithmetic 
primitives can facilitate realisation of 3-D configurable array as part of intelligent optical 
beam steering systems. 
Further increase in the number of pixels/area has to wait for either: 
• a decrease in the size of indium bump, or
• beyond 2010-2013 when 50nm-35nm technology becomes available, or
• both of the above.
Detailed implementation of the bus construct, the control, memory design and the complier 
remains as an exciting frontier for future research. 
107 
VLSI Backplane 
Mirror 
Chapter 8 
Conclusion 
8.1 Overview 
The recent spectacular advances in technology, has resulted in mobile multimedia, 
networking, wired and wireless communications, and optical based and photonics based 
technologies requiring complex broadband connectivity and processing capabilities for 
capture, conversion, compression, decompression, enhancement and display of high quality 
multimedia contents, which place heavy demands on standard 2-D VLSI systems. Solutions 
are needed to cater for increasing technological requirements such as the need for systems 
with compact size, high speed, low power consumption and reliability. Other challenges 
include realising low price, low design time and high volume production systems that can 
be controlled through flexible soft programmable architectures capable of supporting a 
number of different standards and roles. The continuous demand for increasing number of 
devices per chip, has resulted in rapid scaling of process technologies from Small-Scale­
Integration (SSI) to Large-Scale-Integration (LSI), Very-Large-Scale-Integration (VLSI) 
and finally Ultra-Large-Scale-Integration (ULSI) within the next phase of this evolution. 
Meeting these technological challenges, necessitates material and manufacturing process 
improvements as well as creating new architectural and design approaches, since existing 
technological implementations including that of silicon CMOS and the emerging photonics 
arena are constantly being challenged. 
A promising solution proposed in this research, to meet the demands imposed by various 
systems and networks for content-rich multimedia, telecommunication, advanced optical 
systems and networking requirements, is based on 3-D Soft-Chip Technology. This entails 
vertical integration of two ultra-thin 2-D low power circuits, namely "Configurable Array 
Processing" (CAP) circuit and "Intelligent Configurable Switch" (ICS), through fast 
vertical intelligent interconnects (Indium bump), and software mapping. It effectively 
manipulates hardware primitives through vertical integration of control and data and can be 
made highly flexible due to the programmable nature of the CAP chip, as well as highly 
efficient due to the fast Indium bump interconnects and the parallel configurable 
architecture. 
The realisation of 3-D Soft-Chip Technology architecture is enabled through the emergence 
of Truly Deep Submicron Technology, TDST (0. 13µm and below) and the progress in 
vertical connection technology, facilitating transformation from conventional 2-D 
architectures to 3-D systems to address numerous challenges encountered in 2-D CMOS. 
This implementation makes the system ideal for content-rich mobile multimedia type 
applications and calculation of coefficients for holographic beam steering. 
The ability to improve the performance capability of 3-D Soft-Chip system is of crucial 
importance for a successful system. Optimum performance of the system is primarily 
108 
established by the efficiency of each individual component, as well as the network as a 
whole entity. Hence the choice of basic primitives, architectures and the interconnecting 
network is very important. 
The algorithms required to be performed by the 3-D Soft-Chip ALU include various form 
of transforms and optimisation methods and make use of arithmetic functions such as 
addition, multiplication, accumulation and comparison. Important issues that need to be 
addressed in designing the ALU include, choice of primitive adder and multiplier design 
style, choice of process technology used to implement the designs, configurability, area 
optimisation, interconnect structures and compiler design. 
This research is based on implementation of configurable arithmetic primitives such as 
adders and multipliers, suitable to the 3-D Soft-Chip ALU architecture. The goal of this 
research is to use a single primitive having low power dissipation and minimum chip area 
utilisation, which can be configured through simple replication to realise variable word 
lengths as part of the array processing 
8.2 Discussions 
For each implementation, the design cycle for implementing system components involves 
identifying and translating binary data and relative arithmetic operations into design 
algorithm behavioural description, followed by the architectural floor-plan based on this 
description, and finally realising the transistor implementation of the floor-plan. Physical 
implementation of each design involves realising each layout through physical merging, 
stacking and pitch matching of cells, to create custom made optimised layouts connected 
through uniform bus interconnects. The optimised layouts are realised using 0.13µm, 1.2V 
process technology, implemented on Mentor Graphics IC design environment, followed by 
extraction and simulation, to ensure correct functioning. 
Numerous choices are available for implementation of the adder and multiplier primitives. 
Their suitability is examined based on area utilisation and the ability for word length 
expansion (configurability), as part of the "Sea-of-Pixels" architecture for the 3-D chip. The 
choice of logic style used to implement the arithmetic primitives is also important as it 
affects the efficiency of the overall system. In summary: 
• Dynamic circuits use less chip area and are faster compared to static CMOS but
their complexity increase with variations styles of logic, leading to lack of
configurability.
• Static CMOS is popular and produces reliable and widely accepted results
• Complementary Pass Logic (CPL) and Transmission Gate (TG) logic are simple,
compact, fast, and compatible with static CMOS and can be used effectively.
The choice of adder design is also critical as these are a major component of the multiplier 
circuit. Suitable choice of adder design creates an efficient multiplier configuration and 
leads to a successful ALU design. In summary: 
• XOR and XNOR based adder designs are efficient, but can lead to large area
redundancy.
109 
• SPL and SPL TV circuits result in chip area optimisation, however they are sensitive
to threshold voltage lowering associated with scaled technology. Additional buffer
circuitry are needed to alleviate this problem, causing an increase in the overall
circuit power dissipation and chip area utilisation, hence they are unsuitable logic
styles for scaled technologies such as TDST.
• Multiplexer based adder architectures, are fast and have regular architectural
structure providing configurability and area optimisation.
Multiplication is one of the most complex computations within the ALU, hence a logic 
style that would facilitate effective implementation of multiplier primitives will greatly 
increase the efficiency of the ALU. Some algorithms available are: 
• Multiplier design through coefficient optimisation which is fast, has low power
dissipation due to coefficient clustering, disabling of redundant paths, deactivation
circuitry and bypass logic. However circuit elements run slower and additional
circuitry used within the cell increases chip area and software overhead as well as
reducing uniformity and configurability.
• Tree based multiplier architectures are relatively fast, however the structure of the
multiplier elements are not uniform, reducing configurability.
• The structure of the tree based multipliers are suited to full-custom layout with
minimised power dissipation, however the design is large, slow and irregular.
• The Braun multiplier architecture is relatively uniform, allowing configurability.
The area can be optimised through suitable choice of adder design, making the
Braun multiplier a potential choice for the multiplier primitive design.
• Array multipliers are also fast and regular facilitating configurability.
In this thesis two main multiplier configurations are presented. These are: 
• Configurable serial/parallel multiplier,
Serial configuration implies reduced silicon chip area, hence this architecture is
implemented using Carry Save Adder architectural style, and area efficient TG
based adders, to create multiplier cells. The multiplier circuit designed is highly
configurable, and can perform 4/8/12/16/20/24/28/32-bit multiplications. It
consumes a small silicon area of 8314µm2 using 0. 13µ m process and 1 .2V supply
technology.
The main draw back of the Serial/Parallel multiplier approach is that it consumes 4n
clock cycles (n = word length), according to the number of word length used, before
results are made available. For large word lengths this can be a problem. However it
is a possible option for implementation in the 3-D Soft-Chip architecture.
• Configurable parallel/parallel multiplier,
This configuration involves using arrays of 1-bit multipliers designed using
symmetric based, multiplexer based and TG based architectures, (implemented for
comparison purposes), utilising parallel/parallel array structure. Circuits are
designed for signed (using 2's complement representation) and unsigned
multiplications. Design implementations using multiplexer based multipliers result
in considerable reduction in circuit area, compared to those with symmetric based
multipliers. Furthermore configurations designed with TG based multipliers are the
most area efficient of all other styles.
1 10 
Establishing the number of bits to associate with the basic primitives, which is suitable for 
word-length expansion is very important as it provides system flexibility within primitive 
components, to implement with a reasonable degree all different combinations of 
multiplications without compromising performance or substantially increasing overhead 
and bit redundancy during complex computations. Thus the option of having a basic cell of 
4x4 array appears as the most suitable choice for building larger blocks. In this way, most 
multiplication configurations can be implemented maintaining the requirement for an 
optimum area. 
Comparing all the multiplier primitives designed in the course of this research we can 
conclude that: 
• For unsigned multiplication:
The most area efficient circuit configuration with the least number of transistors for
the basic 1-bit multiplier, is the TG based multiplier. Comparing the 8-bit
configurations, the most area efficient multiplier is also the TG based multiplier
with an area reduction of almost 722µm2 compared to its multiplexer based
counterpart.
The 8-bit parallel/parallel multiplier and the 8-bit serial/parallel multiplier are
almost equal in terms of area efficiency. However the TG based configuration is
faster than the serial/parallel multiplier
• For signed multiplication:
The improved TG based 2's complement multiplier is the most area efficient signed
multiplier, with a reduction of approximately 1370µm2 compared to the multiplexer
based 2's complement multiplier, and a reduction of approximately 370 µm2 
compared to the normal TG based 2's complement multiplier.
The proposed ALU floor-plan includes additional elements such as configurable compare 
primitives, control logic, adder-subtracters, multipliers, and register components. The exact 
nature and functionality of the control logic circuitry has to be defined as part of future 
research, based on the JCS chip specifications. In this thesis, only an approximate 
estimation of 40µmx40µm will be made for the dimensions of the ALU layout 
implemented using the 0. 13µm, l.2Vtechnology. Also further detailed implementation of 
the internal and the external bus structure, the control circuitry, memory design and the 
compiler need to be conducted as part of future research. 
8.3 Future directions 
Projecting design dimensions, identifies future utilisation of various technologies, creating 
a roadmap for the functionality of these process technologies. It can forecast potential 
silicon optimisation values, based on the technology to be used. 
The implementation of the proposed arithmetic primitives can facilitate realisation of 3-D 
configurable array as part of intelligent optical beam steering systems, since it is possible to 
accommodate 100,000 processing elements each having dimensions of 40µmx40µm within 
1 1 1  
a die size of 20mmx20mm. This is a great achievement as this dimension is consistent with 
the projected specification of the Opto-VLSI beam steering chip. 
To implement higher density of pixels, requires either further improvements in the area of 
interconnect Indium bump technology where a decrease in the size of indium bump allows 
more interconnections per pixel, or utilisation of smaller process technologies such as the 
50nm or the 35nm technologies. 
Detailed designs implementation of JCS chip, CAP chip, bus configuration, control circuit, 
memory and the complier and related system fabrication processes remains as an interesting 
future research project. 
1 12 
Reference 
[ I ]  K. Eshraghian, N. Weste, "Principles of CMOS VLSI Design, A System
Perspective", Reading, MA: Addison 2°d edition, Addison Wesley, 1993.
[2] K. Eshraghian, "Arithmetic VLSI", personal notes.
[3] "The International Technology Roadmap for Semiconductors - 1999 Edition".
[4] "Semiconductor Industry Association (SIA), International Roadmap for
Semiconductors 2001 edition, TX: International SEMATECH, 2001 ",
http:public.itrs.net.
[5 ] "IEEE Circuits & Devices magazine", March 2002, pp 28-41. 
[6] J. Hutchby, G. Bourianoff, V. Zhrirnov, J. Brewer, "Extending the Road Beyond
CMOS", IEEE Circuits and Devices Magazine, March 2002, pp. 28-41.
[7] H. Nakatsuka, "The New Frontier Created by High-Bandwidth Digital Video
Systems and Services", ISSCC Digest of Technical Papers, pp. 16-19 February
1999.
[8] E. Suhir, "The Future of Microelectronics and Photonics, and the Role of
Mechanical, Materials and Reliability Engineering", Outline of a key-note talk,
MicroMat 2000, April 17-19, 2000, Berlin, Germany.
[9] S. Kuroyanagi, T. Nishi, K. Miyazaki, T. Chujo, A. Hakata, "Optical Path Cross­
Connect System for Large-Capacity and Highly Reliable Optical Transport
Networks", Infrastructure Summit (Networks and Systems), 2000.
[10] N. Collings, W.A. Crossland, P.J. Ayliffe, D.G. Vass, I. Underwood,
"Evolutionary Development of Advanced Liquid Crystal Spatial Light
Modulators", Applied Optics, Volume 28, number 22, November 1989.
[11 ]  W.A. Crossland, I. G. Manolis, ect. ,"Beam Steering Optical Switches Using
LCOS: The 'ROSES' Demonstrator", IEEE/LEOS Summer Topical Meeting, 
Electronics Enhanced Optics, Miami, FL, USA, July 2000, pp. 29-30. 
[12] W.A. crossland, T.D. Wilkinson, I.G. Manolis, M.M. Redmond, A.B. Davey,
"Telecommunications Application of LCOS Devices", Proceeding 9th 
International Topical Meeting on Optics of Liquid Crystals, OLC2001 , Napoli, 
Italy, October 2001. 
113 
[ 13] H. Fang, H.J. Pfleiderer, K. Eshraghian, "Phase-Only Hologram Design of the
Free-Space Optical Crossbar Switch System", Proceedings, The world
Multiconference of Systemics, Cybernetics and Informatics, SCI 2003, pp 196-
201, Orlando, Florida, July 27-30, 2003.
[14] S. Ahderom, M. Raisi,. K. Lo, K. Alameh, R. Mavaddat, "Applications of Liquid
Crystal Spatial Light Modulators in Optical Communications", Proceedings, IEEE
5th International Conference on High-speed Networks and Multimedia
Communications, HSNMC'02, pp. 239-242, Korea, July 2002.
[ 15] A.M. Rassau, T.C.B. Yu, S.W. Lachowicz, H.N. Cheung, K. Eshraghian, W.A.
Crossland, T.D. Wilkinson, "Wavelet Transform Architecture for the Smart
Pixel Mobile Multimedia Communicator", Proceedings, International
Conference on Computational Engineering and Systems Applications, Tunisia,
pp. 455-459, April 1998.
[ 16] P. Berthele, B. Fracasso, J.L. de Bougrenet de la Tocnaye, "Design and
Characterization of a Liquid-Crystal Spatial Light Modulator for a Polarization­
Insensitive Optical Switch", Journal of Applied Optics, Volume 37, 1998.
[ 17] H. Fang, "Technical Report - The Design of 8-Phase Dynamic Computer­
Generated-Holograms Using FLC for Optical Interconnect", Department of
Electronic Engineering, Ulm University, Ulm, Germany, 30 January 2003.
[18] H. Nguyen, S. Lachowicz, K. Eshraghian, "Adaptive Beam Control for
Intelligent Reconfigurable Holographic Optical Switch", 7th World
Multiconference on Systemics, Cybernetics and Informatics, SCI2003, Orlando,
Florida, U.S.A., pp. 202-207, July 27-30.
[ 19] S. Ahderom, M. Raisi, K.E. Alameh, K. Eshraghian, "Adaptive WDM Equalizer
using Opto-VLSI Beam Processing", IEEE Photonics Technological Letters,
volume 15, no 1 1, pp 1603-1605.
[20] S. Lachowicz, A. Rassau, G. Alagoda, K. Eshraghian, M.0. Lee, S.M. Lee,
"Image Capture Using Integrated 3D Soft-Chip Technology", Proceedings, 5th
International Conference on High-Speed Networks and Multimedia
Communications, HSNMC'02, pp. 19-23, July 3-5, 2002, Jeju Island, Korea.
(invited paper).
[2 1] S.M. Lee, S. Lachowicz, D. Lucas, A. Rassau, K. Eshraghian, M.M.O. Lee, K. 
Alameh, "A Novel Design of Beam Steering n-Phase OPTO-ULSI Processor for 
IIPS", 2nd IEEE International Workshop on Electronic Design, Test and 
Applications, Perth, Western Australia, January 28-30, 2004, pp. 395-399. 
1 14 
[22] M.P. Drunes, R.J. Dowling, P. Mckee, D. Wood, "Efficient Optical Elements to 
Generate Intensity Weighted Spot Arrays: Design and Fabrication", Applied 
Optics, vol 30, 1991, 2685-2691. 
[23] M.A. Seldowitz, J.P. Allebach, D. W. Sweeney, "Synthesis of Digital 
Holograms by Direct Binary Search", Applied Optics, vol 26, 1987, pp. 2788-
2798. 
[24] Y. Guoguang, "The Performance Analysis of the Genetic Algorithm for the 
Optimum Design of Diffractive Optical Elements and its Comparison to the 
Simulated Annealing", Applied Optics, vol 111, 2000, pp. 133-137. 
[25] M. Weiser, "The Computer of the 2181 Century", IEEE Pervasive Computing, 
Volume 1, number 1, pp 18-25, January 2002. 
[26] G. Lafruit et al., "3D Computational Graceful Degradation," Proceedings of 
ISCAS - Workshop and Exhibition on MPEG-4, pp. III-547 to III-550, May 
2000. 
[27] V.O.K. Li, X. Qiu, "Personal Communication Systems (PCS)", Proceedings of 
IEEE, Volume 83, number 9, pp. 1210-1243, September 1995. 
[28] J.A. Hutchby, et al., "Extending the Road Beyond CMOS", IEEE Circuits & 
Devices Magazine, March 2002. 
[29] S. Eshraghian, S. Lachowicz, K. Eshraghian, "3-D Vertically Integrated 
Configurable Soft-Chip with Terabit Computational Bandwidth for Image and Data Processing", Proceedings, 10th International Conference Mixed Design of 
Integrated Circuits and Systems MIXDES'2003, Lodz, Poland, June 26-28, 
2003, pp. 143-148. 
[30] S.J. Banerjee, M. Souri, et al., "3-D ICs: A Novel Chip Design for Improving 
Deep-Submicron Interconnect Performance and System-on-Chip Integration," 
Proceedings, IEEE, volume 89, pp. 602-633, 2001. 
[31] J.A. Davis, R. Venkatasan, et al., "Interconnect Limits on Gigascale Integration 
(GSI) in the 21st Century," Proceedings IEEE, volume 89, pp 305-324, 2001. 
[32] A. Rassau, K. Eshraghian , S. Eshraghian S. Lachowicz, A. Ehrhardt, Y. 
Nemirovsky, R. Ginosar R., "3-D Soft-Chip Configurable Array Processor for 
Multimedia and Communications", Proceedings, 5th International Conference 
on High-Speed Networks and Multimedia Communications HSNMC'02 July 3-
5, 2002, Jeju Island, Korea. (keynote paper), pp. 17. 
[33] A.M. Rassau, G. Alagoda, D. Lucas, J. Austin-Crowe, K. Eshraghian, 
"Massively parallel Intelligent Pixel Implementation of a Zerotree Entropy 
Video Codec for Multimedia Communications", Proceedings of the IFIP 
International Conference on VLSI, Lisbon, Portugal, December 1999. 
1 15 
[34] A. Rassau, G. Alagoda, A. Ehrhardt, S. Lachowicz, and K. Eshraghian, "Design
Methodology for a 3D SoftChip Video Processing Architecture". 6th World
Multiconference on Systemics, Cybernetics and Informatics (SCI2002), Orlando,
Florida, U.S.A., July 14- 17, 2002.pp. 324-329.
[35] S. Eshraghian, S. Lachowicz, and K. Eshraghian, "Ultra High Bandwidth Image
and Data Processing using 3-D Vertically Integrated Architectures", 7th World
Multiconference on Systemics, Cybernetics and Informatics (SCI2003),
Orlando, Florida, U.S.A., July 27-30, pp. 189-195.
[36] S.F. Al-Sarawi, D. Abbott, P.D. Franzon, "A Review of 3-D Packaging
Technology," IEEE Transactions on Computers, Packaging, and Manufacturing
Technology - Part B, volume 2 1, number 1, pp. 2- 14, Feb. 1998.
[37] M. Bschorr, R. Geisler, R. Hacker, M. Kicherer, T. Kumpf, C. Layer, W.
Schlecker, "Truly Configurable ALU Array' ,  Technical Report Working Group
- Principles of Optical Networking and Design Lecture Series, University of
Ulm, Germany, July 2002.
[38] K. Eshraghian,S. Lachowicz, S. Eshraghian, A. Osseiran, "Networked
Teletesting Facility for Integrated Systems", Semiconductor Equipment and
Materials International Conference, SEMICON'2003, Singapore, August 12-14,
accepted for publication, (conference deferred due to SARS outbreak).
[39] S. Lachowicz, , G. Alagoda, A. Rassau, K. Eshraghian H.J. Pfleiderer, "Image
Capture for an Integrated 3-D Soft-Chip Processor". 6th World Multiconference
on Systemics, Cybernetics and Informatics (SCI2002), Orlando, Florida, U.S.A.,
July 14-17, 2002.pp. 319-323.
[ 40] K. Eshraghian, S. Lachowicz, and S. Eshraghian, "The Networked Tele-Test
Facility for Integrated Systems in Australia". Proceedings, 9th International
Conference Mixed Design of Integrated Circuits and Systems MIXDES'2002,
Wroclaw, Poland, June 20-22, 2002.pp. 689-694.
[4 1] K. Eshraghian, S. Lachowicz, and S. Eshraghian, "Australian National 
Networked Tele-Test Facility for Integrated Systems". Electronics and 
Structures for MEMS II, Neil W. Bergmann, Editor, Proceedings of SPIE 
Volume 459 1, pp. 22-27. 
[42] K. Eshraghian, S. Lachowicz, , G. Alagoda, K. Ang, "Architectural Mappings
for Multimedia Smart-Pixel Arrays", Proceedings, IEEE International Workshop
on Design, Test and Applications, Dubrovnik, Croatia, June 10-12, 1998, pp. 33-
35.
[43] S. Dunlop, "Computer Organization and Design, The Hardware/Software
Interface", Patterson and Hessessy Publishers, USA, 1999.
1 16 
[44] A. Rothermel, "Integrated Systems", Lecture notes, Department of Electronics
Engineering, Ulm University, Ulm, Germany, Chapters 3.
[45] J. Rabaey, M. Pedram, "Low Power Design Methodologies", Kluwer Academic
Publishers, 1996, USA.
[46] H.R. Srinivas, K.K. Parhi, "A Fast VLSI Adder Architecture", IEEE Journal of
Solid State Circuits, volume 7, number 5, pp 761-767, May 1992.
[47] M. Smith, "Application-Specific Integrated Circuits", Addison Wesley
Longman, Inc. Addison-Wesley, VLSI Design Series, 1997.
[48] G. Yang, S.O. Jung, S.H. Kim, S.M. Kang, "A Low-Power 2. 1GHz 32-bit Carry
Look-Ahead Adder Using Dual Path All-N-Logic," submitted to International
Symposium on Low Power Electronics and Design, 2002.
[49] S. Hong, S. Kim, M.C. Papaefthymiou, W.E. Stark. "Low Power Parallel
Multiplier Design for DSP Applications Through Coefficient Optimization",
12th IEEE International ASIC/SOC Conference, September 1999.
[50] A.D. Booth, "A Signed Binary Multiplication Technique", Quarterly Journal on
Mechanics and Applied Mathematics, volume 4, pp. 236-240, 195 1.
[5 1] "Configurable ALU', Technical Report, Department of Microelectronics, Ulm 
University, Germany, January 2003. 
[52] G .. M. Blair, "Designing Low-Power Digital CMOS", IEE Electronics &
Communication Engineering Journal , volume 6, number 5, pp. 229-236, Oct
1994.
[53] R. Zimmermann, W. Fichtner, " Low-Power Logic Styles: CMOS Verses Pass­
Transistor Logic," IEEE J. of Solid-State Circuits, volume 32, number 7, pp.
1079- 1090, July 1997.
[54] C. Kim, S.O. Jung, K.H. Baek S.M. Kang, "Parallel Dynamic Logic with Speed­
enhanced Skewed Static Logic," International Symposium on Circuits and
Systems, pp.756-759, May. 2000.
[55] Text for ISO/IEC FDIS 14 496- 1 Systems, ISO/IEC JTC 1/SC29/WG1 1  N2501,
Nov. 1998.
[56] L. Seed, "SPL Circuit Techniques I Results for Low Power SPL Investigations",
Lecture Slides, University of Sheffield, Electronic Systems Group, May200 1.
[57] S.O. Jung, K.W. Kim, S.M. Kang "Low Swing Voltage Clock Domino Logic
Incorporating Dual Supply and Dual Threshold Voltages", Proceedings, Design
Automation Conference, 2002.
1 17 
[58] S.O. Jung, K.W. Kim, S.M. Kang "Timing Analysis for Skew-Tolerant High­
Speed Domino Logic Incorporating Dual-Keeper Structure", Proceedings, 
Computer Society International Symposium on VLSI, 2002. 
[59] S.O. Jung, K.W. Kim, and S.M. Kang "Dual Threshold Voltage Domino Logic 
Synthesis for High Performance with Delay and Power Constraint" Design, Automation and Test in Europe, pp.260-265, 2002. 
[60] S.M. Yoo, S.M. Kang, "Improved Domino Structures Effective for High 
Performance Design," Electronics Letters, volume35, number5, pp367-368, March 1999. 
[61] W. Nebel, J. Mermet, " Low Power Design in Deep Submicron Electronics", Kluwer Academic Publishers, 1996, USA. 
[62] P. Chandrakasan, S. Sheng, R. W. Brodersen, "Low-Power CMOS Digital 
Design" IEEE Journal of Solid State Circuits Volume 27, number 4, pp. 4 73-483. 
[63] J. Wang, S. Fang, W. Feng, "New Efficient Designs for XOR and XNOR Functions on the Transistor Level", IEEE Journal of Solid State Circuits Volume 
29, number 7, pp. 780-786. 
[64] H.T. Bui, A.K. Al-Sheraidah, Y. Wang., "New 4-Transistor XOR and XNOR 
Designs", Proceedings of 2nd IEEE Asia Pacific Conference on ASICs, Cheju Island, Korea, pp. 25-28, Aug 2000. 
[65] H.T. Bui, A.K. Al-Sheraidah, Y. Wang, "Design and Analysis of IO-transistor Full Adders Using Novel XOR-XNOR Gates", Technical Report. Florida 
Atlantic University, October 1999. 
[66] B.A. Alhalabi, A. Al-Sheraidah, "A Novel Low Power Multiplexer-Based Full 
Adder Cell", The 8th IEEE International Conference on Electronic Circuits and 
Systems 2001, September 2-5 2001 ,  Malta. 
[67] B.A. Alhalabi, A. Al-Sheraidah, H.T. Bui, "Five New High Performance 
Multiplexer-Based I-Bit Full Adder Cells", The 8th IEEE International 
Conference on Electronic Circuits and Systems 2001, September 2-5 2001, 
Malta. 
[68] S.O Jung and S.M. Kang, "Modular Charge Recycling Pass Transistor Logic 
(MCRPL)", Electronics Letters, volume 36, number 5, pp.404-405, 2000. 
[69] S.M. Yoo, S.M. Kang, "No-race Charge-Recycling Differential Logic(NCDL)" 
pp.202-205, IEEE 9th Great Lakes Symposium on VLSI, March 1999. 
118 
[70] S.M. Yoo and S.M. Kang, "CMOS Pass-Gate No-race Charge-Recycling
Logic(CPNCL)," International Symposium on Circuits and Systems, May 1999.
[7 1] U. Ko, P. Balasara, W. Lee, "Low-Power Design Techniques for High­
Performance CMOS Adders" IEEE Transactions on VLSI Systems, volume 3, 
number 2, pp.327-333, June 1995 . 
[72] D. Radhakrishnan, "Low-Voltage Low-Power CMOS Full Adder", IEE
Proceedings on Circuits, Devices and Systems, volume 148, number 1 ,  pp. 19-
24, February 2001.
[73] A.A. Fayed, M .. A. Bayoumi, "A Low Power 10- Transistor Full Adder Cell for
.Embedded Architectures," in Proceedings IEEE International Symposium on
Circuits and Systems, volume 4, pp. 226-229, Sydney, Australia, May 2001.
[74] H.T. Bui, Y. Wang, Y. Jiang, "Design and Analysis of Low-Power I O-Transistor
Full Adders using Novel XOR-XNOR Gates," IEEE Transactions on Circuits
and Systems-II: Analog and Digital Signal Processing, volume 49, number 1, pp.
25-30, January 2002.
[75] H. Zhuang, H. Hu, " A New Design of the CMOS Full Adder," IEEE J. of Solid­
State Circuits, volume 27, number 5, pp. 840-844, May 1992.
[76] A. Bermak D. Martinez J.L. Noullet, "High Density 16/8/4-bit Configurable
Multiplier", Proceedings, IEE Circuits Devices and Systems, volume 144,
number 5, pp 272-276, 1997.
[77] A. Hwang "Computer Arithmetic: Principles, Architecture and Design", John
Wiley & Sons, Inc., Publishers, New York, N. Y. 1979.
[78] I. Koren, "Computer Arithmetic Algorithms", 2nd Edition, A. K. Peters, Natick,
MA, 2002.
[79] A. Avizienis, "Signed-Digit Number Representations for Fast Parallel
Arithmetic", IRE Transactions on Electronic Computers, volume EC-10, pp.
389-400, September 1961.
[80] G .. M. Blair, "The Equivalence of Twos-Complement Addition and the
Conversion of Redundant-Binary to Twos-Complement Numbers", IEEE
Transactions Circuits & Systems I, 1998.
[8 1] C.E. Kozirakis et al. "Scalable Processors in the Billion Transistors Era: IRAM",
IEEE Computers Sept 1997, pp 75 -78. 
1 19 
Appendix A 
Scalable Parallel/Parallel Multiplier 
Schematics and Simulation Results 
A.O Summary
The following sections contain the schematic diagrams and the simulation results for all the 
designs implemented in Chapter 6. 
A.1 Unsigned Multipliers
This class of multipliers assume that the binary inputs to the system are unsigned values. 
A.1.1 Unsigned Symmetric Based Multipliers
A.1.1.1 1-Bit Symmetric Based Multiplier Circuit
The I-bit symmetric based multiplier consists of an AND circuit and a full adder circuit. 
The schematic for the AND circuit is illustrated in Figure A 1. 
Figure A.1 AND Circuit Schematic 
The schematic for the full adder circuit is shown in Figure A.2. 
Figure A.2 Symmetric Full Adder Circuit Schematic 
120 
The schematic and the simulation results for the I -bit symmetric based multiplier are 
illustrated in Figure A.3. 
(b) 
Figure A.3 1-Bit Symmetric Based Multiplier (a) Schematic (b) Simulation 
A.1.1.2 2-Bit Symmetric Based Multiplier Circuit
The schematic and the simulation results for the 2-bit symmetric based multiplier are 
illustrated in Figure A.4. 
l • .--���������---; 
I . 
l·J. 
i-J.
I 'f1-: 
'
1-:i
·--
i�tJ --__ti...__ __ � 
i :I . � - � - - - - - � - - - � - -<-===:.:.....:....���-��-----" -' �����
(a) (b) 
Figure A.4 2-Bit Symmetric Based Multiplier (a) Schematic (b) Simulation 
A.1.1.3 4-Bit Symmetric Based Multiplier Circuit
The schematic diagram and a sample of the simulation results for the 4-bit symmetric based 
multiplier design are illustrated in Figure A.5. 
121 
(a) (b) 
Figure A.5 4-Bit Symmetric Based Multiplier (a) Schematic (b) Simulation 
A.1.1.4 8-Bit Symmetric Based Multiplier Circuit
The schematic and a sample of the simulation results for the 8-bit symmetric based 
multiplier circuit are shown in Figure A.6. 
(a) (b) 
Figure A.6 8-Bit Symmetric Based Multiplier (a) Schematic (b) Simulation 
A.1.2 Unsigned Multiplexer Based Multipliers
A.1.2.1 1-Bit Multiplexer Based Multiplier Circuit
The schematic and the simulation results for the 1-bit multiplexer multiplier circuit are 
illustrated in Figure A.7. 
122 
I 
LI __ .._4 
(a) (b) 
Figure A.7 Multiplexer based 1-Bit multiplier (a) Schematic (b) Simulation 
A.1.2.2 2-Bit Multiplexer Based Multiplier Circuit
The schematic and the simulation results for this circuit are illustrated in Figure A.8 . 
. 
t-: 
't-.' I I.. :----1 • •  ,__J 
(a) (b) 
Figure A.8 2-Bit Multiplexer based Multiplier (a) Schematic (b) Simulation 
A.1.2.3 4-Bit Multiplexer Based Multiplier Circuit
The schematic and a sample of the simulation results for the 4-bit multiplexer based 
multiplier are illustrated in Figure A.9. 
123 
,-
,-
J 
,,.. .. 
. .  - .. - .. ..  ,:..• ... - - .. - - - -
(a) (b) 
Figure A.9 4-Bit Multiplexer Based Multiplier (a) Schematic (b) Simulation 
A.1.2.4 8-Bit Multiplexer Based Multiplier Circuit
The schematic and a sample of the simulation results for this circuit are shown in Figure 
A.10.
(a) (b) 
Figure A.10 8-Bit Multiplexer Based Multiplier (a) Schematic (b) Simulation 
A.1.3 Unsigned Multiplexer Based Multiplier Design with Periphery Multiplexers Per
Cell
This category of design involves placing periphery multiplexers around each 1-bit 
multiplier, allowing communication between multiplier cells. Each cell required one 
vertical and two horizontal periphery multiplexers. The multipliers are all multiplexer 
based. 
124 
_.... ......................... --------------���
A.1.3.1 Vertical Multiplexer Circuit
The schematic and the simulation results for the vertical multiplexer circuit are illustrated 
in Figure A.11. 
(a) (b) 
Figure A.11 Vertical Multiplexer (a) Schematic (b) Simulation 
A.1.3.2 Horizontal Multiplexer Circuit
The schematic and the simulation results for the horizontal multiplexer circuit, Mux2, are 
illustrated in Figure A.12. 
J 
(a) (b) 
Figure A.12 Horizontal Multiplexer· (a) Schematic (b) Simulation 
A.1.3.3 1-Bit Multiplier with Periphery Multiplexers Per· Bit
The schematic and the simulation results of the 1-bit multiplier with periphery multiplexers 
per bit, are illustrated in Figure A.13. 
125 
(a) (b) 
Figure A.13 1- Bit Multiplier with Periphery Multiplexers Per Bit (a) Schematic (b) 
Simulation 
A.1.3.4 2-Bit Multiplier with Pe.-iphery Multiplexers Per Bit
The schematic and the simulation results for this circuit are illustrated in Figure A 14. 
!-: 
t-: • 
1-: 
1- .
1-: I 
(a) (b) 
Figure A.14 2-Bit Multiplier with Periphery Multiplexers Per Bit (a) Schematic (b) 
Simulation 
A.1.3.5 4-Bit Multiplier with Periphery Multiplexers Per Bit
The schematic and a sample of the simulation results for this design are illustrated in Figure 
A.15.
126 
I 
1-: 
I 
1-
. . . .. . .. .. . .. . . . . . . ... .-.. -
_.......... ........... _. ....... ____________ ��-
(a) (b) 
Figm·e A.15 4-Bit Multiplier with Periphery Multiplexers Per Bit (a) Schematic (b) 
Simulation 
A.1.3.6 8-Bit Multiplier with Periphery Multiplexers Per Bit
The schematic and a sample of the simulation results for the 8-bit multiplier with periphery 
multiplexers per bit, follows in Figure A 16. 
(a) (b) 
Figure A.16 8-Bit Multiplier with Periphery Multiplexers Per Bit (a) Schematic (b) 
Simulation 
127 
_.. .............. _. ...... _. ______________ ��
A.1.4 Unsigned Multiplexer Based Multiplier Design with Periphery Multiplexers Per
Array
A.1.4.1 2-Bit Multiplier Circuit with Peripheral Multiplexers Per Array
The schematic and the simulation results for the 2-bit multiplier with peripheral 
multiplexers per array, are illustrated in Figure A.17. 
(a) (b) 
Figure A.17 2-Bit Multiplier with Peripheral Multiplexers Per Array (a) Schematic 
(b) Simulation
A.1.4.2 4-Bit Multiplier Circuit with Peripheral Multiplexers Per Array
The schematic and a sample of the simulation results for this circuit are illustrated in Figure 
A.18.
(a) (b) 
Figure A.18 4-Bit Multiplier Circuit with Peripheral Multiplexers Per Array 
(a) Schematic (b) Simulation
128 
A.1.4.3 8-Bit Multiplier Circuit with Peripheral Multiplexers Per 4-Bit Array
The schematic of the circuit is illustrated in Figure A.19. 
Figure A.19 8-Bit Multiplier Circuit with Peripheral Multiplexers Per 4-Bit Array 
Schematic 
A.1.5 Unsigned Transmission-Gate (TG) Based Multiplier Circuit
All TG based multiplier arrays in the following sections, implement the peripheral 
multiplexer per array structure. 
A.1.5.1 1-Bit TG Based Multiplier Circuit
The schematic and the simulation results for the 1-bit TG based multiplier circuit are 
illustrated in Figure A.20. 
(a) 
---- --------­
'"'---"--a.----
(b) 
Figure A.20 1-Bit TG Based Multiplier (a) Schematic (b) Simulation 
129 
A.1.5.2 2-Bit TG based Multiplier Circuit
The schematic and the simulation results for the 2-bit TG based multiplier are illustrated in 
Figure A.21. 
(a) (b) 
Figure A.21 2-Bit TG Based Multiplier (a) Schematic (b) Simulation 
A.1.5.3 4-Bit TG based Multiplier Circuit
The schematic and the simulation results for this circuit are illustrated in Figure A.22. 
(a) (b) 
Figure A.22 4-Bit TG Based Multiplier (a) Schematic (b) Simulation 
130 
A.1.5.4 8-Bit TG Based Multiplier Circuit
The schematic and a sample of the simulation results of the 8-bit TG based multiplier 
circuit are illustrated in Figure A.23. 
(a) 
Figure A.23 8-Bit TG Based Multiplier (a) Schematic (b) Simulation 
131 
(b) 
A.2 Signed Multiplication
Signed multiplication algorithms assume that the inputs are positive or negative binary 
numbers. For the purpose of comparison, the signed multiplication configurations will be 
implemented using multiplexer based and TG based multipliers. 
A.2.1 Signed Multiplexer Based Multipliers
A.2.1.1 2-Bit 2's Complement Multiplier Circuit
The algorithm for computing 2's complement multiplication, involves adding internal 
multiplexers to compute the signed representation. All multiplier cells are multiplexer 
based. 
The Schematic and the simulation results for the internal multiplexer circuit, are illustrated 
in Figure A.24. 
f-'= 
(a) (b) 
Figure A.24 Internal Multiplexer (a) Schemati c (b) Simulation 
The schematic and the simulation results for the 2-bit 2's complement multiplier circuit are 
illustrated in Figure A.25. 
- - -� •"'"1 -� - - - - _. -
(a) (b) 
Figure A.25 2-Bit 2's Complement Multiplier (a) Schematic (b) Simulation 
132 
.................................. --------------���
A.2.1.2 4-Bit 2's Complement Multiplier Circuit
The schematic and a sample of the simulation results for this circuit are illustrated in Figure 
A.26.
(a) (b) 
Figure A.26 4-Bit 2's Complement Multiplier (a) Schematic (b) Simulation 
A.2.1.3 8-Bit 2's Complement Multiplier· Circuit
The schematic and a sample of the simulation results of the circuit are illustrated in Figure 
A.27.
(a) (b) 
Figure A.27 8-Bit 2's Complement Multiplier· (a) Schematic (b) Simulation 
133 
................................ -------------��-
A.2.2 Signed TG Based Multipliers
A.2.2.1 2-Bit TG based 2's Complement Multiplier Circuit
The schematic and the simulation results of the 2-bit 2's complement TG based circuit are 
illustrated in Figure A.28. 
(a) 
I • 
1-. 
l -.: . 
t-.: 
1-. 
- - - -i.;;;;;;:=;;;_ ____ _ 
(b) 
Figure A.28 2-Bit 2's Complement TG Based Multiplier (a) Schematic (b) 
Simulation 
A.2.2.2 TG based 4-Bit 2's Complement Multiplier Circuit
The schematic and a sample of the simulation results of the circuit are illustrated in Figure 
A.29.
I,-; 
!-:f 1_-L..r··-...r1.�w,-d. :.-.
(a) (b) 
Figure A.29 4-Bit 2's Complement TG Based Multiplier (a) Schematic (b) 
Simulation 
134 
A.2.2.3 8-Bit 2's Complement TG based Multiplier Circuit
The schematic and a sample of the simulation results of the circuit are illustrated in Figure 
A.30.
(a) (b) 
Figure A.30 4-Bit 2's Complement TG Based Multiplier (a) Schematic (b) 
Simulation 
A.2.3 An Alternative Improved Algorithm for Implementing Signed Multiplication
The additional components necessary to implement the improved signed multiplication 
algorithm, are inverter cell, multiplexed half adder cell, and multiplier cell with NAND 
capability. 
TG based multipliers are much more area efficient compared to the symmetric and 
multiplexer based multipliers, hence they will be used to implement all multiplier circuits in 
the following sections. The schematic and the simulation results for the inverter circuit are 
illustrated in Figure A.31. 
(a) (b) 
Figure A.31 Inverter (a) Schematic (b) Simulation 
135 
The schematic and the simulation results for the multiplexed half adder circuit are 
illustrated in Figure A.32. 
i-
i-
1-
(a) (b) 
Figure A.32 Multiplexed Half-Adder (a) Schematic (b) Simulation 
The schematic and the simulation results for the 1-Bit 2' s Complement TG Based 
Multiplier Circuit with NAND Capability, are illustrated in Figure A.33. 
,� 
1--: •
(a) (b) 
Figure A.33 1-Bit 2's Complement TG Based Multiplier (a) Schematic (b) 
Simulation 
A.2.3.1 Improved 4-Bit 2's Complement TG based Multiplier Circuit
The schematic and a sample of the simulation results of the improved 4-bit 2's complement 
TG based multiplier circuit are illustrated in Figure A.34. 
136 
(a) (b) 
Figure A.34 Improved 4-Bit 2's Complement Multiplier (a) Schematic (b) 
Simulation 
A.2.3.2 Improved 8-Bit 2's Complement Multiplier Circuit
The schematic and a sample of the simulation results of the improved 8-bit 2's complement 
multiplier circuit are illustrated in Figure A.35. 
(a) 
Figure A.35 Improved 8-Bit 2's Complement Multiplier (a) Schematic (b) 
Simulation 
137 
i �L• 
-:L 
I .Iii 
j "'; I .:I t 
1 ..:c 
l �t -�..;;...,...;:;;::=::��
==......::::;=:-:=:=JI 
I �  
J -;l 
1-:t 
J -:1 T._.._____,_�--
J � L 
(b) 

1 39 
