Computer Architectures Using Nanotechnology by Sun, Yichun
Lehigh University
Lehigh Preserve
Theses and Dissertations
2012
Computer Architectures Using Nanotechnology
Yichun Sun
Lehigh University
Follow this and additional works at: http://preserve.lehigh.edu/etd
This Dissertation is brought to you for free and open access by Lehigh Preserve. It has been accepted for inclusion in Theses and Dissertations by an
authorized administrator of Lehigh Preserve. For more information, please contact preserve@lehigh.edu.
Recommended Citation
Sun, Yichun, "Computer Architectures Using Nanotechnology" (2012). Theses and Dissertations. Paper 1155.
COMPUTER ARCHITECTURES
USING NANOTECHNOLOGY
by
Yichun Sun
A Dissertation
Presented to the Graduate Committee
of Lehigh University
in Candidacy for the Degree of
Doctor of Philosophy
in
Electrical Engineering
Lehigh University
January 2012
c© Copyright 2011 by Yichun Sun
All Rights Reserved
ii
This dissertation is accepted in partial fulfillment of the requirements for the degree
of Doctor of Philosophy.
(Date)
Meghanad D. Wagh
Miltiadis K. Hatalis
Zhiyuan Yan
Viswanath Annampedu
iii
Acknowledgments
I want to take this opportunity to place on record, my deep sense of gratitude to my
Ph.D. Advisor, Dr. Meghanad D. Wagh. This work would be impossible without
his motivation and untiring guidance and help with my research.
I am extremely thankful to Dr. Zhiyuan Yan, Dr. Miltiadis Hatalis and Dr.
Viswanath Annampedu for serving on my dissertation committee and providing
valuable feedback on my research. I specially thank Dr. viswanath Annampedu,
whose research inspired my work.
Thanks to my parents, for being so cooperative and encouraging with all my work
and for giving me a solid educational foundation which inspired me to accomplish
such a Ph.D. level work.
This work was supported by NSF in part under grant ECCS-0925890. I am
grateful for the support. Also I thank the Electrical Engineering department of
Lehigh University for the teaching assistantship.
iv
Contents
Acknowledgments iv
Abstract 1
1 Introduction 3
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Strategy and Contribution . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . 7
2 Background 8
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Resonant Tunneling Diodes . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Quantum-Dot Cellular Automata . . . . . . . . . . . . . . . . . . . . 10
2.4 Threshold Function Fundamentals . . . . . . . . . . . . . . . . . . . . 12
2.4.1 Definition and Examples . . . . . . . . . . . . . . . . . . . . . 12
2.4.2 Implementation in Nanotechnology . . . . . . . . . . . . . . . 12
2.4.3 Properties of Threshold Functions . . . . . . . . . . . . . . . . 14
2.5 Differences in Design between CMOS and Nano . . . . . . . . . . . . 18
2.5.1 Design Methodology . . . . . . . . . . . . . . . . . . . . . . . 18
2.5.2 Clocking Schemes . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.6 Design restriction of Nanotechnology . . . . . . . . . . . . . . . . . . 21
3 Adder Architectures Using Nanotechnology 23
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
v
3.2 Conventional Adders using Threshold Logic . . . . . . . . . . . . . . 24
3.3 General Scheme for Building New Adders . . . . . . . . . . . . . . . . 30
3.4 A Low Depth Threshold Adder (LDTA) . . . . . . . . . . . . . . . . 33
3.5 A Low Complexity Threshold Adder (LCTA) . . . . . . . . . . . . . 37
3.6 An Enhanced Low Delay
Threshold Adder(ELDTA) . . . . . . . . . . . . . . . . . . . . . . . . 42
3.7 Discussion and Conclusion of Threshold Adders . . . . . . . . . . . . 48
4 Tree Implementation of Combinational Functions 51
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2 k-ary tree decomposition structure . . . . . . . . . . . . . . . . . . . 58
4.3 Comparison Function Decomposition . . . . . . . . . . . . . . . . . . 63
4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5 Systolic Implementation of Threshold Function Decomposition 72
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.1.1 Systolic Architecture . . . . . . . . . . . . . . . . . . . . . . . 72
5.1.2 Decomposition of Threshold Functions . . . . . . . . . . . . . 73
5.2 Systolic System for Nanotechnology . . . . . . . . . . . . . . . . . . . 74
5.2.1 Clocking Scheme . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.2.2 General Scheme of Decomposing a Threshold Function with
Systolic System . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.2.3 The Whole System . . . . . . . . . . . . . . . . . . . . . . . . 78
5.3 Applications and Examples . . . . . . . . . . . . . . . . . . . . . . . 80
5.3.1 Majority Gate . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.3.2 Pattern Matching Machine . . . . . . . . . . . . . . . . . . . . 81
5.4 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . 82
6 Digital Circuits for Quantum-Dot Cellular Automata 85
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.2 Quantum-Dot Cellular Automata . . . . . . . . . . . . . . . . . . . . 86
vi
6.3 Comparison Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.4 A Majority Gate Adder . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
7 Conclusions 95
7.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Bibliography 98
Vita 104
vii
List of Tables
2.1 Determining weights and threshold of carry c2 of a 3-bit addition. . . 17
2.2 Determining weights and threshold of function G2:0. . . . . . . . . . . 17
2.3 Determining weights and threshold of function T2:0. . . . . . . . . . . 18
2.4 Complexity comparison of CMOS and Nanotechnology implementa-
tions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1 Comparison of Carry Lookahead Adder, CPA, GCLA and our LDTA. 36
3.2 Comparison of the delay of threshold implementations of CPA, GCLA,
LDTA and ELDTA using gates with fan-in bound of 5. . . . . . . . . 48
3.3 Comparison of the complexity of threshold implementations of CPA,
GCLA, LDTA and ELDTA using gates with fan-in bound of 5. . . . . 49
4.1 Comparison of binary tree comparator and quaternary tree compara-
tor with fan-in bound of 4. . . . . . . . . . . . . . . . . . . . . . . . . 70
4.2 Comparison of binary tree comparator and quaternary tree compara-
tor with fan-in bound of 8. . . . . . . . . . . . . . . . . . . . . . . . . 70
5.1 data of gates at cycle t and t+ 1. . . . . . . . . . . . . . . . . . . . . 78
5.2 Size of majority gate implemented with bounded fan-in. Results of
[1, 2] are shown in parenthesis for comparison. . . . . . . . . . . . . . 83
5.3 Depth of majority gate implemented with bounded fan-in. Results
of [1, 2] are shown in parenthesis for comparison. . . . . . . . . . . . . 83
viii
List of Figures
2.1 I − V characteristics of a RTD. . . . . . . . . . . . . . . . . . . . . . 9
2.2 Schematic representation of a MOBILE. . . . . . . . . . . . . . . . . 10
2.3 Basic four-dot QCA cell showing the two possible polarizations. . . . 11
2.4 A QCA wire. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5 RTD implementation of a threshold function. . . . . . . . . . . . . . . 13
2.6 QCA implementation of a 3-input majority function. . . . . . . . . . 13
2.7 Four-phase clocking scheme for nanotechnology. . . . . . . . . . . . . 20
3.1 Carry computation followed by the sum calculation in a Carry Prop-
agation Adder (CPA) . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Computing all the carries of an 8-bit LDTA with a fan-in bound of 5. 35
3.3 Carry computation in a 16 bit LCTA with fan-in bound of 5. . . . . . 38
3.4 The architecture of a 30-bit ELDTA with fan-in 2M + 1 = 5. . . . . . 48
4.1 A threshold function of n variables and threshold T partitioned into
four fragments and recreated by a tree of recombiners. R equals sum
of all positive weights minus T . . . . . . . . . . . . . . . . . . . . . . 53
4.2 A general structure of a k-ary tree decomposition. . . . . . . . . . . . 59
4.3 A ternary tree decomposition of a comparator of size 18 with fan-in
bounded by 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.1 A six-phase clocking scheme with the evaluate (E), hold (H), reset
(R) and wait (W) states. . . . . . . . . . . . . . . . . . . . . . . . . . 74
ix
5.2 General scheme of decomposing a threshold function by serially com-
bining the fragment outputs. . . . . . . . . . . . . . . . . . . . . . . . 75
5.3 Parallel to serial convertor. . . . . . . . . . . . . . . . . . . . . . . . . 76
5.4 Systolic implementation of the recombiners in Fig.5.2. . . . . . . . . . 77
5.5 The first recombining stage. . . . . . . . . . . . . . . . . . . . . . . . 77
5.6 The first stage of the complete system. . . . . . . . . . . . . . . . . . 79
5.7 A 16-bit majority gate with fan-in bound 4. . . . . . . . . . . . . . . 81
6.1 (a) The basic QCA cell (b) a polarized QCA cell representing logic 1
and (c) a polarized QCA cell representing logic 0. . . . . . . . . . . . 86
6.2 An inverter implementation in QCA. . . . . . . . . . . . . . . . . . . 86
6.3 A 3-input majority gate implementation in QCA and its symbol. . . . 87
6.4 QCA realization of function f which is 1 only when its argument
is between X1 and Y1 or between X2 and Y2. Note that C(·) is a
comparison function. . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.5 A serial realization of comparison C(B) which outputs a 1 when s =
xn−1xn−2 · · ·x0 ≥ bn−1bn−2 · · · b0. . . . . . . . . . . . . . . . . . . . . . 89
6.6 A 4-bit comparator architecture which produces a 1 when x3x2x1x0 ≥
b3b2b1b0. (The majority gates in gray need not be implemented.) . . . 90
6.7 Layout of a 2 bit comparator in QCA technology. . . . . . . . . . . . 91
6.8 The same comparator as in Fig. 6.6 after elimination of wire crossings. 91
6.9 Calculating carries in an 8-bit LDTA using only 3 input majority gates. 94
x
Abstract
According to the International Technology Roadmap for Semiconductors, the emerg-
ing research devices such as Resonant Tunneling Diodes/Transistors (RTD/RTT),
Single Electron Transistors(SET) and Quantum Cellular Automata (QCA) are ex-
pected to start replacing the CMOS devices in many applications by the end of the
next decade. Unfortunately, these new technologies cannot implement the tradi-
tional Boolean logic efficiently. On the other hand, they are well suited for thresh-
old logic. Clearly, along with the development of the new devices, one should also
explore the new design techniques that are compatible with these devices.
This work focuses on the development of strategies for design and implementation
of computer architectures with nanotechnology. Recent studies have demonstrated
that the reliability of a nanoelectronic logic gate is dependent upon its fan-in. All
the designs presented in this work therefore employ bounded fan-in threshold logic.
We develop general schemes to decompose any threshold function into a network
of threshold gates with bounded fan-in using a k-ary tree structure. For some
applications, e.g., comparison function, the decomposition scheme leads to a circuit
that has a lower complexity and higher speed.
A new strategy to design adders which form the fundamental logic block of
any computational unit will be discussed. This strategy allows one to design adders
with required speed and complexity using threshold gates with a given fan-in bound.
We show that by exploiting the properties of threshold functions, one can get new
adder architectures that are much faster and far less complex than the Group Carry
Look-ahead Adders (GCLA) presently employed in most modern processors. Our
strategies also allow us a three-way trade-off between the hardware complexity, the
1
reliability and the speed.
Different implementation styles with nanoelectronic threshold gates will also be
presented. We show that a systolic implementation of threshold logic can allow one
to implement circuits with extremely low hardware complexity but with some loss
of speed. We also develop designs using only majority gates which are well suited
for Quantum-Dot Cellular Automata (QCA) technology.
2
Chapter 1
Introduction
1.1 Motivation
For the last several decades, complementary metal-oxide semiconductor (CMOS)
technology has dominated the implementation of very large scale integrated (VLSI)
systems. CMOS delivers high speed and consumes low power, the two advantages
that are important in the demanding high technology mobile applications of today.
The speed of a circuit can be increased and its power dissipation decreased by re-
ducing the physical size of the CMOS device. However, this brings in additional
problems such as leakage currents. It is therefore becoming increasingly difficult
to shrink the CMOS devices to further improve high speed and low power. Ac-
cording to the International Technology Roadmap for Semiconductors [3], further
shrinking of CMOS will become uneconomical by 2020 and new devices based upon
nanotechnology will start replacing the CMOS devices.
The development of the physics and material technology required for nanotech-
nology is already well understood and some simple nanoscale devices have been
created. Digital systems based on nanotechnology are expected to have smaller area
(and therefore higher density), higher speed and lower power dissipation. For exam-
ple, the devices based upon Resonant Tunneling Diodes/Transistors (RTD/RTT)
are expected to reach switching speeds up to 16 THz, those based upon (molecular)
3
1.1. MOTIVATION
Quantum Cellular Automata (QCA) are expected to have packing densities of 1012
devices/cm2 and those based upon Single Electron Technology (SET) are expected
to have switching energies as low as 1× 10−18 J [3]. All three of these technologies
and simple devices based on these have been demonstrated in various laboratories
around the world.
Though these technologies are quite different from each other, the basic logic
block implemented by all the three is a threshold gate [4–7]. In the CMOS technology
also, a threshold gate can be efficiently realized within the Field Programmable Gate
Array (FPGA). This is because the FPGA implements logic functions using Look
Up Tables (LUT). A threshold function can be directly realized by an LUT from its
truth table. As a result, both in the current CMOS technology and in the future
technologies such as RTD/RTT, SET and QCA, threshold logic plays an important
role.
Basic Boolean gates such as the AND/OR/NOT/NAND/NOR are variations of
the threshold gate. Thus one can always express the conventional gates in a digital
design as threshold gates, thereby providing its nanoelectronic (or FPGA) imple-
mentation. However, this design translation fails to harness the true power of the
threshold gate which allows one to realize very complex Boolean expressions, often
through single gates. (See Examples 1-6 in Subsec. 2.4.3.) Clearly there is a need to
develop strategies to design digital systems directly for threshold implementation.
A Boolean function that can be realized with a single threshold gate is known
as a threshold function. Finding a threshold implementation of a given Boolean
function implies decomposing that function into one or more threshold functions.
Unfortunately, the number of threshold functions is highly restricted. For example,
while there are 22
4
= 65, 536 Boolean functions of four variables, only 1,882 (≈
2.87%) of these are threshold functions. Similarly, out of more than four billion
Boolean functions of five variables, less than 0.0023% (exactly 94,572) are threshold
[8]. There is also another factor which complicates these designs. It has been
4
1.2. STRATEGY AND CONTRIBUTION
shown that the reliability of nanoelectronic gates is greatly dependent on their fan-
in1 [9–11]. This implies that the intended designs can only use threshold gates
with a specified fan-in bound. However, unlike AND/OR gates which can be easily
decomposed into similar gates with smaller fan-in, threshold gates with large fan-
in cannot be readily decomposed into threshold gates with a smaller fan-in. Thus
the design strategies should consider implementation with bounded fan-in threshold
gates from the onset.
Naturally the nanoelectronic device technology research has attracted a great
deal of attention recently. Unfortunately, the research in developing new architec-
tures using these devices is lagging far behind. This dissertation focuses on the
application aspect in the hope that as the technology matures, so do the techniques
to use these new devices meaningfully.
1.2 Strategy and Contribution
This dissertation is focused on the development of threshold logic architectures for
key arithmetic logic units such as the comparators, adders and majority gates. All
of the circuits that we design use bounded fan-in gates. We develop different design
schemes to optimize properties such as high speed and low complexity. Some of
our design strategies even allow tradeoffs between the two. Since our procedures
are designed from the onset to be implemented in nanotechnology, they use fewer
devices and have smaller size and lower delay as compared to the direct translation
of traditional CMOS designs to nanotechnology.
We develop a general scheme of decomposing arbitrary threshold logic into
threshold gates with bounded fan-in for higher reliability. An example applica-
tion of a comparator is given to illustrate the general decomposition scheme. The
resultant circuit uses fewer gates and interconnects while having a lower delay.
A large part of this research is devoted to developing faster and low complexity
designs of adders which play a central role in arithmetic computations. For this, we
1Number of inputs to a gate is called its fan-in.
5
1.2. STRATEGY AND CONTRIBUTION
develop an entirely new framework that can be used to develop threshold adders
with any desirable features. The most common CMOS high performance adder,
the Group Carry Look-ahead Adder is based on the carry-generation and carry-
propagation logic primitives. However, the carry-propagation primitive uses ExOR
gates which are complex to implement in threshold logic. Our new framework
replaces this primitive with a new logic primitive that is well suited for threshold
implementation. Using this new framework, we obtain designs for a Low Delay
Threshold Adder (LDTA) and a Low Complexity Threshold Adder (LCTA) [12].
We then provide a design procedure for an Enhanced Low Delay Threshold Adder
(ELDTA) which has an improved architecture to give an even lower delay and lower
hardware complexity. A comparison with threshold implementations of traditional
adders such as the Carry Propagation Adder (CPA) and the Group Carry Look-
ahead Adder (GCLA) clearly shows the benefits of this new scheme. Note that
for the purpose of this comparison, these traditional adders were also modified to
optimize their performance when implemented with bounded fan-in threshold gates.
It needs to be stressed that the strategy presented in this work is not limited to
LDTA, LCTA or ELDTA, and can indeed be used to obtain a variety of adders with
different trade-offs between the speed and the hardware complexity.
We also investigated threshold logic with feedback loops. As a result, we have
been able to give a novel design strategy to implement systolic architectures with
bounded fan-in threshold gates. This design features extremely low hardware com-
plexity with a serial output stream. Since systolic circuits have feedback loops
instead of one directional data flow, a new clocking scheme with six phases is in-
troduced. The two example applications, the majority logic implementation and
the pattern matching machine are provided and substantiate our claims of very low
complexity.
Last we provide digital circuits design for Quantum-Dot Cellular Automata
(QCA). Given the fact that QCA can implement very limited types of digital cir-
cuits efficiently, including a 3-input majority gate. We develop comparator [13] and
adder with majority gates only so that they are applicable to QCA implementation.
6
1.3. ORGANIZATION OF THE DISSERTATION
1.3 Organization of the Dissertation
The Dissertation is organized as follows. In Chapter 2 we introduce the concept of
threshold functions and their nanotechnology implementation using Resonant Tun-
neling Diodes (RTD) and Quantum-Dot Cellular Automata (QCA). We also discuss
the differences in digital design between the traditional CMOS technology and new
nanotechnology. Then Chapter 4 shows a general scheme of implementing combi-
national functions using tree architecture. After that, in Chapter 3, we discuss the
nanotechnology implementation of adder architecture, which is a key combinational
logic block in most arithmetic circuits. We give the implementation for traditional
group carry look-ahead adder (GCLA) and present new designs of adder architec-
ture, which shows attractive features of low hardware complexity and high speed.
A systolic implementation of some combinational and sequential logics is discussed
in Chapter 5. In Chapter 6 we discuss the digital circuits designed specifically for
Quantum-Dot Cellular Automata (QCA) implementation. Conclusion and Future
work are discussed in Chapter 7.
7
Chapter 2
Background
2.1 Introduction
According to the international technology roadmap for semiconductors, the emerg-
ing research devices such as Resonant Tunneling Diodes/ Transistors (RTD/RTT),
Single Electron Transistors (SET) and Quantum Cellular automata (QCA) are ex-
pected to start replacing the CMOS devices in many applications by the end of the
next decade. This chapter introduces the digital design using nano-scale devices. Fo-
cusing on RTD and QCA, this chapter discusses the fundamentals of these devices.
These devices are well suited for threshold functions, which are more powerful than
the Boolean logic primitives. The fundamentals of threshold logic are presented in
this chapter. It also discusses threshold implementations of some Boolean functions
which were traditionally built with CMOS gates.
This chapter is organized as follows. Sec. 2.2 and Sec. 2.3 introduce the funda-
mentals of RTD and QCA respectively. We then provide the definition and prop-
erties of the basic logic block implemented by these devices, namely the threshold
function in Sec. 2.4. The implementation of threshold functions using RTD and
QCA is discussed in this section as well. Examples of building complex boolean
functions using nano-scaled devices are presented to illustrate the properties as well
as the powerfulness of digital design based on nanotechnology. Finally, Sec. 2.5
8
2.2. RESONANT TUNNELING DIODES
discusses the differences between designs using CMOS and nanotechnology.
2.2 Resonant Tunneling Diodes
Integrated circuit designers are constantly thriving to reduce the size of the tran-
sistors so as to reduce the complexity and power, and increase the speed and yield.
However, there exists a fundamental size limit beyond which one cannot shrink the
transistor. If one goes from the µm scale to nanometer, current sub µm Si-CMOS
transistors cease to operate. In particular, at base region widths comparable to the
electron wavelengths, the potential barrier at base leaks severely, allowing electrons
to tunnel from the emitter to the collector. At these sizes, conventional transis-
tors lose their switching ability. On the other hand, this same tunneling effect can
be exploited in nanoscaled devices such as resonant tunneling diodes (RTDs) and
resonant tunneling transistors (RTTs).
RTDs and RTTs use tunneling of electrons at discrete energy levels in a double
barrier quantum well structure. The tunneling phenomena create the effect of neg-
ative resistance giving the current-voltage (I − V ) characteristics shown in Fig. 2.1.
The voltage and the current at the peak of the characteristic curve in Fig. 2.1 are
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x 10−3
V −−>
I −
−>
Figure 2.1: I − V characteristics of a RTD.
referred to as the peak voltage (VP ) and peak current (IP ) respectively.
RTDs and RTTs are attractive, because they are the most mature of all the
9
2.3. QUANTUM-DOT CELLULAR AUTOMATA
nanoelectronic devices [14, 15] and have already been demonstrated and studied by
many researchers. Many large signal models for RTDs and RTTs have also been
proposed [16–19].
The basic RTD structure for implementing digital logic is the monostable-bistable
transition logic element (MOBILE) [20–22]. MOBILE consists of two serially con-
nected RTDs with an trapezoidal oscillating bias voltage VSWP as shown in Fig. 2.2.
V SWP
V
 out
A
B
Figure 2.2: Schematic representation of a MOBILE.
The MOBILE computes when VSWP is raised from zero to a voltage slightly
above 2VP . At the end of the compute cycle, the output voltage Vout is high (a
logic 1) if the peak current of RTD A is larger than that of RTD B. Voltage Vout
is low (logic 0) if the peak current of RTD B is larger. One may add additional
RTDs in parallel to RTDs A and B to change the effective peak current in the top
or bottom section of the MOBILE. Further, these additional RTDs can be brought
in and out of the circuit by HFETs in series which are driven by logical inputs.
Thus the output state of a MOBILE is decided by the input variables. Using this
and other similar configurations, researchers have already implemented many logic
functions [23,4, 24,25].
2.3 Quantum-Dot Cellular Automata
Quantum-dot Cellular Automata (QCA), first described in [26], provides a very
different computation platform than traditional CMOS. In QCA, the polarization,
10
2.3. QUANTUM-DOT CELLULAR AUTOMATA
rather than the current, contains the digital information. Further, the digital gates
and the interconnections are all made up of the same cells in QCA.
A basic QCA cell consists of four quantum dots in a square array coupled by
tunnel barriers. The cell is loaded with two extra electrons which tend to occupy the
diagonally opposite positions in the cell because of coulomb repulsion. Therefore
in this cell two stable polarization states exist which can be used to represent logic
values 1 and 0 as shown in Fig. 2.3.
electron quantum dot
binary 1 binary 0
Figure 2.3: Basic four-dot QCA cell showing the two possible polarizations.
Because the electrons are quantum mechanical particles they are able to tunnel
between the dots in a cell. The electrons in cells placed adjacent to each other will
interact. As a result, coulomb interactions between the electrons force neighboring
cells to synchronize their polarization. Therefore an array of QCA cells act as a wire,
shown in Fig. 2.4, and is able to transmit information from one end to another [26];
i.e. all the cells in the wire will switch their polarizations to follow that of the input
or driver cell.
1 1
Figure 2.4: A QCA wire.
QCA is ideal for implementing inverters and three input majority gates. By
11
2.4. THRESHOLD FUNCTION FUNDAMENTALS
using 0 or 1 at one input, a three input majority gate can be converted to a two
input AND or OR gate respectively. Thus any Boolean logic can be implemented
using QCAs, but with poor efficiency. In the past, QCA architectures of the XOR
gate, crossing wires, minority gate, full adder, and a comparison function have also
been investigated [27–30,13].
2.4 Threshold Function Fundamentals
2.4.1 Definition and Examples
As explained earlier, nanotechnologies such as RTD, RTT and QCA allow direct
implementation only of threshold functions. Threshold functions are attractive be-
cause they are very powerful in expressing complex Boolean expressions in a much
more hardware efficient manner. A Boolean function f(x1, x2, . . . , xn) is called a
threshold function if there exist real numbers w1, w2, . . ., wn and T such that
f(x1, x2, . . . , xn) =
 1 if
∑n
i=1wixi ≥ T
0 otherwise.
(2.1)
The constants w1, w2, . . ., wn are called the weights of the inputs and T , the thresh-
old. We denote a threshold function as TH(x1, x2, . . . , xn;w1, w2, . . . , wn;T ). Since
scaling of weights and threshold by the same amount has no effect on the inequality
in (2.1), we use integer values for weights and the threshold. A Generalized Major-
ity function is a threshold function with all unit weights. A Majority function is a
generalized majority function with a threshold equal to bn/2c+ 1.
2.4.2 Implementation in Nanotechnology
Realizations of threshold functions are called threshold gates. A variety of physical
effects can be employed to create a weighted sum of the inputs and then compare
it with a preset threshold. These include voltage or current scaling and summation
using operational amplifiers [31], charge deposition and summation using a parallel
capacitance network [32]. Recently, nanotechnology devices such as QCA, RTD
12
2.4. THRESHOLD FUNCTION FUNDAMENTALS
and RTT, described in the previous section, are dominantly used for implementing
threshold functions.
MOBILE, along with HFET as switches, is an ideal solution to computing
a weighted sum and comparing it with a threshold due to the I-V characteris-
tics of RTDs. As an example, Fig. 2.5 shows a threshold function x + yz =
TH(x, y, z; 2,−1, 1; 1) implemented as a MOBILE.
VSWP
x
y
z
1
1
2
1
1
−1
1
2
1
T=1
f(x, y, z)
y
z
f(x, y, z)x
Figure 2.5: RTD implementation of a threshold function.
As mentioned in Sec. 2.3, QCA is able to realize a three input majority function.
This is an important threshold function and widely used in this work. Fig. 2.6 shows
QCA implementation of a 3-input majority function.
1
1
1 2 f(x,y,z)y
x
z
y
z
f(x, y, z)x
Figure 2.6: QCA implementation of a 3-input majority function.
13
2.4. THRESHOLD FUNCTION FUNDAMENTALS
2.4.3 Properties of Threshold Functions
The following characteristics of threshold functions are important in the rest of the
dissertation. Properties (P4) - (P6) are not available in literature but are crucial to
this work.
Theorem 1 (Properties of threshold functions.)
(P1): x1 + x2 = TH(x1, x2; 1, 1; 1).
(P2): x1x2 = TH(x1, x2; 1, 1; 2).
(P3): x1x2 + x3(x1 + x2) = TH(x1, x2, x3; 1, 1, 1; 2) =MAJ(x1, x2, x3).
(P4): Let f = TH(x1, x2, . . . , xn;w1, w2, . . . , wn;T ) where each wi ≥ 0. Then
a function g of n + 2 variables including two new variables xn+1 and xn+2
defined as g = xn+1xn+2 + (xn+1 + xn+2)f is also a threshold function. In
particular, g = TH(x1, x2, . . . , xn, xn+1, xn+2;w1, w2, . . . , wn, wn+1, wn+1;T
′),
where the threshold T ′ of g is chosen to satisfy T ′ ≥ 2T and T ′ > ∑ni=1wi and
wn+1 = T
′ − T .
(P5): Define a series of functions recursively as: f1 = x2 + x1x0 and fi =
x2i + x2i−1fi−1, i = 2, 3, 4, . . .. Then each function fi is a threshold function.
In particular,
fi = TH(x0, x1, x2, . . . , x2i;F0, F1, F2, . . . , F2i;F2i) where Fj is the jth number
in the Fibonacci sequence1.
(P6): Define a series of functions recursively as: f1 = x1+x0 and fi = x2i−1+
x2i−2fi−1, i = 2, 3, 4, . . .. Then each function fi is a threshold function. In
particular,
fi = TH(x0, x1, . . . , x2i−1;F0, F1, . . . , F2i−1;F2i−1) where Fj is the jth number
in the Fibonacci series.
1Fibonacci sequence is defined by the difference equation Fj = Fj−1 + Fj−2 with the initial
conditions F0 = F1 = 1.
14
2.4. THRESHOLD FUNCTION FUNDAMENTALS
(P7): The function fi of 2i+ 1 variables in (P5) above can be realized using a
chain of di/Me threshold gates with fan-in bound of 2M+1. The total number
of inputs to these functions equals 2i+ di/Me.
(P8): A function hi = x0x1x2 · · ·xi of i + 1 variables can be realized using a
chain of di/(2M)e threshold gates with fan-in bound of 2M + 1. The total
number of inputs to these gates equals i+ di/2Me.
Proof (Sketch). Properties (P1)-(P3) can be verified using the function truth ta-
bles. Property (P4) can be proved by separately considering cases when the triple
(xn+1, xn+2, f) takes distinct values and verifying that in each case, the given weights
and threshold provide the correct value of the function g.
Proofs of properties (P5) and (P6) are similar. We therefore only focus on (P5)
which is proved by mathematical induction. In property (P5), f1 can be easily
verified to be threshold with weights of x0, x1 and x2 to be 1, 1 and 2 and the
threshold being 2. Assume that fi−1 is a threshold function with the weights and
threshold as stated in the theorem. Function fi should be 1 if either x2i = 1 or if
x2i−1 = 1 and fi−1 = 1. In the first of these cases, weight of x2i is the same as the
threshold F2i. Thus (2.1) implies that fi = 1. In the second case, since fi−1 = 1,
weighted sum of the variables x0 through x2i−2 is at least equal to the threshold of
fi−1, namely F2i−2. Since x2i−1 = 1 as well, the weighted sum of all the variables
x0 through x2i of fi is at least equal to F2i−2 + F2i−1 = F2i, the threshold of fi.
Therefore again from (2.1) one gets that fi = 1. Similarly, Using the identity that
F0 + F1 + · · · + F2i−2 = F2i − 1, one can show that when x2i = 0 and at least one
of x2i−1 or fi−1 equal 0, the weighed sum of the variables in fi is less than F2i, the
threshold of fi, Therefore fi = 0 as expected from the form of the function.
Property (P7) is proved by construction. One can partition the original function
in a series of functions: the first function with variables x0 through x2M , the second
with the output of the previous function along with variables x2M+1 through x4M
and so on till all variables are exhausted. There are di/Me functions in this series.
Each of these functions can be implemented as a single threshold gate because of
(P5) if the fan-in is odd and because of (P5) and (P6) if the fan-in is even.
15
2.4. THRESHOLD FUNCTION FUNDAMENTALS
Property (P8) is proved similarly. The original function can be partitioned into
a series of functions: the first with variables x0 through x2M , the second with the
output of the first function with 2M more variables and so on until all i+1 variables
are exhausted. There are di/2Me such functions,each of which is a 2M + 1 input
AND gate, which is obviously a threshold function.
Application of Theorem 1 is illustrated by the following examples.
Example 1 By using (P2) followed by (P4), one gets
a1b1 + (a1 + b1)a0b0 = TH(a0, b0, a1, b1; 1, 1, 2, 2; 4).
Example 2 By using (P2) followed by (P4) twice, one gets
a2b2 + (a2 + b2)(a1b1 + (a1 + b1)a0b0) = TH(a0, b0, a1, b1, a2, b2; 1, 1, 2, 2, 4, 4; 8).
Example 3 By using (P1) followed by (P4) twice, one gets
a2b2 + (a2 + b2)(a1b1 + (a1 + b1)(a0 + b0))
= TH(a0, b0, a1, b1, a2, b2; 1, 1, 2, 2, 4, 4; 7).
Example 4 By using (P3) followed by (P4) twice, one gets
a2b2 + (a2 + b2)(a1b1 + (a1 + b1)(a0b0 + c−1(a0 + b0)))
= TH(c−1, a0, b0, a1, b1, a2, b2; 1, 1, 1, 2, 2, 4, 4; 8).
Example 5 From (P5) one gets
g3 + p3(g2 + p2(g1 + p1(g0 + p0c−1)))
= TH(c−1, p0, g0, p1, g1, p2, g2, p3, g3; 1, 1, 2, 3, 5, 8, 13, 21, 34; 34).
Example 6 Using (P7), the function in Example 5 can be implemented by threshold
gates with fan-in bound of 5 as:
TH(TH(c−1, p0, g0, p1, g1; 1, 1, 2, 3, 5; 5), p2, g2, p3, g3; 1, 1, 2, 3, 5; 5).
16
2.4. THRESHOLD FUNCTION FUNDAMENTALS
Table 2.1: Determining weights and threshold of carry c2 of a 3-bit addition.
function weights threshold property
a2 b2 a1 b1 a0 b0 c−1 used
c 1 1 P1
c0 = a0b0 + (a0 + b0)c−1 1 1 1 2 P6
c1 = a1b1 + (a1 + b1)c0 2 2 1 1 1 4 P6
c2 = a2b2 + (a2 + b2)c1 4 4 2 2 1 1 1 8 P6
Table 2.2: Determining weights and threshold of function G2:0.
function weights threshold property
a2 b2 a1 b1 a0 b0 used
G0:0 = a0b0 1 1 2 P2
G1:0 = a1b1 + (a1 + b1)G0:0 2 2 1 1 4 P6
G2:0 = a2b2 + (a2 + b2)G1:0 4 4 2 2 1 1 8 P6
Example 7 Let c2 denote the output carry of a 3 bit addition (a2a1a0) + (b2b1b0) +
c−1. Then from Table 2.1,
c2 = a2b2 + (a2 + b2)(a1b1 + (a1 + b1)(a0b0 + (a0 + b0)c))
= TH(a2, b2, a1, b1, a0, b0, c−1; 4, 4, 2, 2, 1, 1, 1; 8).
Example 8 Let G2:0 denote the function specifying if a carry is generated within
bit positions 0 through 2 in an addition. We show later that G2:0 = a2b2 + (a2 +
b2)(a1b1 + (a1 + b1)(a0b0)). From Table 2.2 one gets
G2:0 = TH(a2, b2, a1, b1, a0, b0; 4, 4, 2, 2, 1, 1; 8).
Example 9 Let T2:0 denote the function specifying if a carry is generated within
or propagated through bit positions 0 through 2 in an addition. We show later that
T2:0 = a2b2 + (a2 + b2)(a1b1 + (a1 + b1)(a0 + b0)). From Table 2.3 one can see that
T2:0 = TH(a2, b2, a1, b1, a0, b0; 4, 4, 2, 2, 1, 1; 7).
17
2.5. DIFFERENCES IN DESIGN BETWEEN CMOS AND NANO
Table 2.3: Determining weights and threshold of function T2:0.
function weights threshold property
a2 b2 a1 b1 a0 b0 used
T0:0 = a0 + b0 1 1 1 P2
T1:0 = a1b1 + (a1 + b1)T0:0 2 2 1 1 3 P6
T2:0 = a2b2 + (a2 + b2)T1:0 4 4 2 2 1 1 7 P6
2.5 Differences in Design between CMOS and Nano
Although much research has been done on the development and characterization of
nanoscale devices, little attention is paid to the impact of these devices on circuit
design. The differences between the traditional CMOS design and the novel nan-
otechnology design lie in two main aspects, the design methodology (Boolean versus
threshold) and the clocking (static versus dynamic).
2.5.1 Design Methodology
The difference in the two design methodologies is essentially rooted in the fact
that CMOS technology is ideal to implement Boolean logic while nanotechnology
is better suited for threshold logic. Note that threshold functions allow one to
determine the Boolean value of the function by computing an arithmetic sum and
compare it with the threshold T . The use of real numbers and the arithmetic sum
rules out CMOS logic to implement such a complex arithmetic operation. However,
the nanotechnology MOBILE is ideally suited for such computation. Further, the
complexity of an n-variable threshold function is only O(n), and is independent of
the algebraic complexity or the number of minterms of the function.
It is easy to interpret the implementation of some basic gates from CMOS tech-
nology to nanotechnology. Table 2.4 shows the threshold representation of some
basic gates and compares their implementation complexity in the CMOS and Nano
technologies. Note that the complexity of CMOS design refers to the number of
transistors while RTD design the area of the circuit, governed by the sum of weights
18
2.5. DIFFERENCES IN DESIGN BETWEEN CMOS AND NANO
Table 2.4: Complexity comparison of CMOS and Nanotechnology implementations.
Logic Function TH Complexity Complexity Complexity
representation in CMOS technology with RTD with QCA
−1 0
2 1 13
1
1 2 8 4 5
−1
−1 −1 4 3
1
1 1 8 3 5
−1
−1 0
4 2
and threshold. QCA complexity is measured by the number of QCA cells used in
the implementation. However, we focus on the relative costs of different gates which
forces different design strategies in these cases.
One can easily draw some conclusions from the table. Firstly, in CMOS technol-
ogy, NAND is one of the least expensive gate and most design tools convert circuits
to NAND gates. However, in RTD technology, OR is just as inexpensive as NAND,
and NOR is the least complex gate. One should therefore implement circuits us-
ing NORs. In QCA technology, ANDs and ORs are most economical. Secondly,
in CMOS technology, any function and its dual have the same complexity while
in nanotechnology, they cost different. Thirdly, Boolean functions may have to be
expressed differently to be cost efficient in the two technologies.
This shows that due to the technology differences, substantially different design
strategies may have to be adopted to minimize costs. The design methods must
take into account the differences in the economics and scaling of gates. Further,
while CMOS can only use basic gates, nanotechnology can often directly implement
complex functions through single threshold gates.
19
2.5. DIFFERENCES IN DESIGN BETWEEN CMOS AND NANO
2.5.2 Clocking Schemes
Another critical difference between the CMOS and the nanotechnologies is the na-
ture of the computation. While CMOS combinational logic is static (does not require
a clock), the computation in nanotechnology is possible only with a clock. The delay
of a nanotechnology combinational circuit is dependent upon its clocking scheme and
the depth. Even in a sequential circuit implementation, different clocking schemes
causes quite different realizations in the two technologies.
One of the most unique and novel properties of MOBILE or QCA cell is that
its output is valid when the clock is high. Thus it is self-latching. This property
can often be exploited to achieve a nanopipeline by constructing a cascaded network
of MOBILEs or a network of QCA cells. In such architectures, clocking should be
designed such that the computation in any stage start only after the previous stage
finishes. One possible solution to this is the four-phase overlapping clocking scheme
shown in Fig. 2.7 [33].
clk3
4
clk1
clk2
1     2     3    4    1     2     3     4
T/2 T
clk
Time
hold
reset
wait
evaluate
Figure 2.7: Four-phase clocking scheme for nanotechnology.
Each clock has four equal phases of (T/4) period. Clocks to successive stages are
delayed by one phase. During the evaluate phase, the output of a gate is computed.
The result is held valid through the hold phase while the subsequent stage is com-
puting. In the reset phase, the gate returns to the monostable mode of operation
20
2.6. DESIGN RESTRICTION OF NANOTECHNOLOGY
and the data is erased. During the wait phase, the inputs of the gate are loaded
from the outputs of the gates in the previous pipeline stage [34]. As a result of
the four-phase clocking scheme of nanotechnology, the depth of a nanopipeline state
is four MOBILEs or four QCA cells as a single clock cycle activates four gates in
parallel due to four clock phases.
This four-phase clocking scheme causes the design focus in nanotechnology to be
different from CMOS. In CMOS circuits, the depth of a circuit determines its critical
path and thereby the delay of the whole system. Therefore, in CMOS technology,
depth is one of the most critical aspects that evaluate the performance of the circuits.
On the other hand, in nanotechnology, as the whole circuit is automatically pipelined
due to the self-latching nanotechnology devices, the depth of the system no longer
controls its speed and thus is not an important design parameter. This is not to say
that the depth of a nanotechnology circuit is irrelevant to its performance. But if one
can trade-off hardware and time complexities, then designing for a lower complexity
at the cost of a small increase in depth is often preferred.
2.6 Design restriction of Nanotechnology
It has been shown that the reliability of nanoelectronic gates is greatly dependent
on their fan-in [9–11]. Similarly, since the LUTs in an FPGA have a finite number
of inputs, they can only implement threshold gates with a limited fan-in. These
constraints imply that the intended designs can only use threshold gates with a
specified fan-in bound. However, unlike AND/OR gates which can be easily decom-
posed into similar gates with smaller fan-in, threshold gates with large fan-in cannot
be readily decomposed into threshold gates with a smaller fan-in. Thus the design
strategies for FPGA and nanoelectronic systems should consider implementation
with bounded fan-in threshold gates from the onset.
Threshold functions are often classified based on their implementations as fol-
lows. The class of Boolean functions that can be implemented by single threshold
21
2.6. DESIGN RESTRICTION OF NANOTECHNOLOGY
gates with unbounded fan-in is known as LT1. In general, Boolean functions com-
putable by depth-d networks of unbounded fan-in threshold gates belong to class
LTd. The size (number of threshold gates) of an LTd implementation is restricted
to a polynomial function of the number of inputs. Members of the class LTd which
have weights of polynomial order form a subclass denoted by L̂T d. Clearly functions
in L̂T d are more realistic from the implementation standpoint. The class of Boolean
functions which can be realized by constant depth, polynomial size networks of
threshold functions with ±1 weights is denoted by TC0. Since any function in L̂T 1
can be converted to a threshold function with ±1 weights by duplicating inputs, it
follows that L̂T d ⊆ TC0 for any constant d [35]. The only class that uses threshold
gates with bounded fan-in is the class NCk. This class consists of Boolean func-
tions that can be implemented as a polynomial size, depth O((log n)k) network of
bounded fan-in AND, OR and NOT gates. It is known that TC0 ⊆ NC1 [36, 37].
Thus any function in L̂T 1 can be implemented using a (log n)-depth network of
three specific types of bounded fan-in threshold gates; AND, OR and NOT. How-
ever, threshold functions are very powerful and a single threshold gate can often
replace a complex network of AND, OR and NOT gates. To exploit the power of
threshold functions, we provide explicit decomposition of any member of LT d into
a network of arbitrary threshold gates with bounded fan-in in Chapter 4. Beyond
that all our designs compose of threshold gates that are fan-in bounded.
22
Chapter 3
Adder Architectures Using
Nanotechnology
3.1 Introduction
Arithmetic Logic such as addition and addition related operations like multiplication
play an important role in any computational architecture, including the omnipresent
electronic computer. With the development of nanotechnology implementation of
threshold, people have dedicated efforts for building these arithmetic units using
threshold networks. Siu et al. have proved that the addition of two numbers belongs
to L̂T 2 [38]. Maciel have further improved this result by stating that the addition
function can be implemented within a majority depth of 0 and a total depth of
2 [37]. A depth-2 and polynomial-sized adder is presented using only non-monotone
majority gates in [39]. Unfortunately, all the adders presented above are built with
unbounded fan-in threshold gates. A signed digit adder with a restricted fan-in
is given by Cotofana [40]. This adder with the assumption of radix-2 signed digit
representation uses O(n) threshold gates with O(1) weights and fan-in complexity
[40]. The adders we propose in this chapter use threshold gates with considerably
smaller weights and fan-in.
23
3.2. CONVENTIONAL ADDERS USING THRESHOLD LOGIC
3.2 Conventional Adders using Threshold Logic
This section provides implementations of the Carry Propagation Adder (CPA) and
the Group Carry Lookadead adder (GCLA) using threshold functions. All the adders
presented in this work compute the carry bits first and then obtain the sum bits
from them. Consider the addition of two N -bit numbers an−1, an−1, . . . , a0 and
bn−1, bn−2, . . . , b0. Let ci and si denote the carry and sum at the ith bit position.
Clearly, si = ci−1 ⊕ ai ⊕ bi. Though implementing an ExOR (⊕) in threshold gates
is complex, we can show that the sum bits can be obtained from the carries through
only one threshold gate stage as follows.
si = ci−1 ⊕ ai ⊕ bi
= c¯i(ai + bi + ci−1) + aibici−1, (3.1)
Expression (3.1) may be proved by considering cases when ci−1 = 0 and ci−1 = 1.
In the first case, ci = aibi and si = ai ⊕ bi = aibi(ai + bi) which matches with (3.1).
When ci−1 = 1, ci = ai + bi and si = ai ⊕ bi = aibi + aibi, thus validating (3.1) in
this case also. Further, because of Theorem 1 (P7), si in (3.1) is a single threshold
function. In particular,
si = TH(ai, bi, ci−1, ci; 1, 1, 1,−2; 1). (3.2)
Note that the sum bit si can also be expressed as a 3-input majority function as
follows:
si = TH(ai, bi, ci−1, ci, ci). (3.3)
Since sum bits can be computed from the carry bits by one stage of threshold
gates, rest of this work will focus only on computing the carry bits. The delay and
complexity estimates given in the rest of the work represent those only to compute
the carries.
The simplest of adder, the Carry Propagation Adder (CPA) generates a carry ci
from ci−1 using the expression
ci = aibi + (ai + bi)ci−1 =MAJ(ai, bi, ci−1).
24
3.2. CONVENTIONAL ADDERS USING THRESHOLD LOGIC
Thus the N - bit CPA uses N 3-input majority gates to compute all the carries. Its
delay is N and its hardware complexity is n gates and 3N inputs. Fig. 3.1 shows
the threshold implementation of a CPA.
bbbba a a a 001122N −1N −1
. . .
. . .
. . .
MAJ MAJ MAJ MAJ
−2−2−2
1 1
1
1 1
1
1
1
1
1 1 1
−2
1 1
1
1
s sss 012N −1cN −1
c
−1
. . .
Figure 3.1: Carry computation followed by the sum calculation in a Carry Propa-
gation Adder (CPA)
The O(N) delay of CPA is not acceptable in many situations. In these appli-
cations, one uses a Group Carry Look-Ahead Adder (GCLA) [41]. GCLA has an
O(logN) delay at the cost of modest increase in hardware complexity. GCLA is
based upon two logic primitives defined by groups of operand bits: a carry gen-
erator and a carry propagator, these primitives are first computed over single bit
positions. They can then be combined repeatedly to provide the primitives for in-
creasing group sizes. As we show below, they can be used to compute all the carries
in O(logN) time complexity.
The carry generator and carry propagator primitives g1i and p
1
i are obtained from
the operand bits at the i-th position, ai and bi as
g1i = aibi and p
1
i = ai ⊕ bi, 0 ≤ i < N.
If N is small, each carry ci can be directly obtained as:
ci = g
1
i + p
1
i g
1
i−1 + p
1
i p
1
i−1g
1
i−2 + · · ·+ p1i p1i−1 · · · p10c−1. (3.4)
25
3.2. CONVENTIONAL ADDERS USING THRESHOLD LOGIC
Each of these carry expressions can be computed concurrently and in constant time.
However, if the operand size N is larger, then one needs to compute the carry gen-
eration/propagation primitives over groups of consecutive bit positions. Generally
group size is chosen to be 4. One thus computes the second level of primitives g2
and p2 from g1 and p1 as:
g2i = g
1
4i+3 + p
1
4i+3g
1
4i+2 + p
1
4i+3p
1
4i+2g
1
4i+1 + p
1
4i+3p
1
4i+2p
1
4i+1g
1
4i, 0 ≤ N < N/4, (3.5)
and
p2i = p
1
4i+3p
1
4i+2p
1
4i+1p
1
4i, 0 ≤ N < N/4. (3.6)
All the carries c4i, 0 ≤ i < N/4 can then be computed concurrently from these g2
and p2 values as:
c4i = g
2
i + p
2
i g
2
i−1 + p
2
i p
2
i−1g
2
i−2 + · · ·+ p2i p2i−1 · · · p20c−1. (3.7)
The intermediate carries for each ith group of bits are obtained concurrently from
the g1 and p1 for bit positions in that group and c4i−1 using an equation similar to
(3.4). If N > 16, the process is repeated by computing g3 and p3 for groups of 16
bit positions, computing carries that are 16 bit apart, computing carries that are
4 bit apart and computing the rest of the carries. Thus every time the adder size
N increases 4 times, the time to compute carries increases by a constant amount
giving the time complexity of GCLA as O(logN).
In this work, we also consider the group size, k, of GCLA to be a variable of the
adder’s performance. Now we give a nanotechnology implementation of GCLA.
Theorem 2 GCLA can be implemented using threshold gates with a depth of O( logN
M
)
and a number of inputs of O(NM) where N = 22n is the length of addition and
M + 1 = 2m + 1 is the fan-in of the threshold gates.
Proof.
The depth of GCLA is determined by the largest expression of propagators or
generators or carries in each grouping stage.
26
3.2. CONVENTIONAL ADDERS USING THRESHOLD LOGIC
For the first stage, as p1i and g
1
i are needed for every bit, and as XOR gate is not
threshold logic, it takes 2 levels to implement p1i , that is
p1i = TH(TH(ai, bi; 1,−1; 1), ai, bi; 2,−1, 1; 1). (3.8)
And
g1i = TH(ai, b1; 1, 1; 2) (3.9)
needs only 1 level. Thus it takes 2 levels for the implementation of p1i and g
1
i .
For p2i and g
2
i , as now the group size is k instead of 4, Eq.3.5 and Eq.3.6 can be
modified to
p2i = p
1
kip
1
ki+1 · · · p1ki+k−1, (3.10)
g2i = g
1
ki+k−1 + p
1
ki+k−1(g
1
ki+k−2 + p
1
ki+k−2(g
1
ki+k−3 + · · · p1ki+1g1ki)). (3.11)
As both expressions are serial, they both are threshold logic. As g2i has 2k − 1
literals, which is than k of p2i , it determines that the levels need for this grouping
stage is d2k−2
M
e.
Similarly, p3i and g
3
i can be obtained from p
2
i and g
2
i with the same hardware.
As there are totally logkN − 1 such grouping stages till the last stage only has 1
group on k members, number of levels needed to implement all the propagators and
generators is 2 + d2k−2
M
e(logkN − 1).
For carries, we can first get cN
k
i−1, where i = 1, 2, · · · , k. From Eq.3.7, we can
see that the carry with most literals is cN
k
−1, which has 2k+1 literals. Thus we can
get this stage takes d2k
M
e levels.
With the above carries, then we can calculate the intermediate carries. Still
from Eq.3.7, the most complex expression now has 2k− 1 literals, indicating d2k−2
M
e
levels, as there are only k − 1 to be calculated in each group. And totally there are
logkN − 1 stages for carries calculation, thus the number of levels for these carries
is d2k−2
M
e(logkN − 1).
With all the above sum up with 1 level for sum bits, the total depth of GCLA
is 2d2k−2
M
e(logkN − 1) + d2kM e+ 3, which is of O( logk NM ).
Then we discuss the number of inputs in GCLA.
27
3.2. CONVENTIONAL ADDERS USING THRESHOLD LOGIC
Recall Eq.3.8 and 3.9 for the implementation of propagation and generation in
the first stage, it is clear that it takes 5 threshold gates to build each propagator
and 2 to build each generator. As they are required for every bit position, it totally
takes 7N inputs.
For p2i and g
2
i , from Eq.3.10, we know that p
2
i each has k literals, meaning it
requires dk−1
M
e threshold gates to implement it. That results in dk−1
M
e − 1 inputs in
addition to the original k inputs. So each p2i takes k + dk−1M e − 1 inputs.
Similarly, from Eq.3.11, each g2i has 2k − 1 literals, resulting in a number of
inputs of 2k + d2k−2
M
e − 2.
And for p2i and g
2
i , there are
N
k
of them, as each p2i and g
2
i carries the property
of k bits from the previous stage.
Similarly, each p3i and g
3
i can be obtained from p
2
i and g
2
i with the same com-
plexity. And there are N
k2
pairs of propagator and generators in this stage. As there
are totally logkN − 1 such grouping stages till the last stage only has 1 group of k
members, number of inputs in the hardware implementing all the propagators and
generators is
(3k − 3 + dk−1
M
e+ d2k−2
M
e)(N
k
+ N
k2
+ · · ·+ k2 + k)
= N−k
k−1 (3k − 3 + dk−1M e+ d2k−2M e) (3.12)
.
Then we calculate the number of inputs in the circuit for calculating the carries.
Here we discuss this issue in two cases.
• Case1: If 2k < M .
Still, we start with cN
k
i−1, where i = 1, 2, · · · , k. In this case, from Eq.3.7, the
one has the most literals, namely cN
k
−1 can be implemented with a single gate.
As cN
k
−1 has 2k+1 variables and that is within the bound of the fan-in. Then
the total number of inputs is the sum of literals in the expression of each cN
k
i−1.
it is simple to calculate the number of inputs as 3 + 5 + 7 + · · · + (2k + 1) =
k(k + 2).
28
3.2. CONVENTIONAL ADDERS USING THRESHOLD LOGIC
Then we can calculate the intermediate carries with gaps of N
k
. Still from
Eq.3.7, as the most complex expression now has 2k − 1 literals, which is still
within the bound of the fan-in, all the carries at this stage takes only one gate
each. Then similar to the previous stage, it takes 3 + 5 + 7+ · · ·+ (2k− 1) =
(k+1)(k−1) inputs for each group of k−1 members and there are k such groups
in the stage. So the total number of inputs in this stage is k(k + 1)(k − 1).
Similarly, the next stage calculates all the carries with gaps of N
k2
. And it takes
the same number of inputs for each group of k − 1 members but the number
of groups is k2 so it takes k2(k+1)(k−1) inputs for this stage. In this way, as
there are logkN−1 stages for carries calculation, the total number of inputs for
these carries can be obtained as (k+1)(k−1)(k+k2+· · ·+N
k
) = (N−k)(k+1).
Summing up the number of inputs discuss above from generators and prop-
agators to carries, the total number of inputs in the case that 2k < M is
11N − 2k +Nk + N−k
k−1 (dk−1M e+ d2k−2M e).
• Case1: If 2k ≥M .
In this case, we assume that both k and M are powers of 2.
In the first stage of carry calculation, as now the carries with more thanM+1
literals cannot be implemented with one single gate, we use the result of carries
with small literals to calculate the more complex ones. For example, it we have
a 64-bit adder with k = 4 and M = 4, then in this stage c15, c31, c47, c63 are
to be calculated. From Eq.3.7, we can write these carries as
c15 = g
3
0 + p
3
0c−1, (3.13)
c31 = g
3
1 + p
3
1(g
3
0 + p
3
0c−1), (3.14)
c47 = g
3
2 + p
3
2(g
3
1 + p
3
1(g
3
0 + p
3
0c−1)), (3.15)
c63 = g
3
3 + p
3
3(g
3
2 + p
3
2(g
3
1 + p
3
1(g
3
0 + p
3
0c−1))). (3.16)
As now c47 and c63 has 7 and 9 literals, respectively, that exceed the bound of
fan-in, 5. Then we rewrite c47 and c63 as
c47 = g
3
2 + p
3
2c31, (3.17)
29
3.3. GENERAL SCHEME FOR BUILDING NEW ADDERS
c63 = g
3
3 + p
3
3(g
3
2 + p
3
2c31). (3.18)
Now they take result of c31 as input so that it reduces the number of inputs
from 7 to 3 and 9 to 5, respectively. Note that the depth of this stage does
not change with this modification. With this, the number of inputs for c 64
4
i−1
can be got as 2(3 + 5).
Now we generalize it from the example. The fan-in limits that the number of
carries implemented with one gate is M
2
and these carries have a total number
of 3+5+7+· · ·+(M+1) = M(M+4)
4
. Then there are still that many carries take
cN
k
M
2
i−1 as an input and they have the same number of inputs of
M(M+4)
4
. As
there are totally 2k
M
such units, this stage of carries have altogether 2k
M
M(M+4)
4
inputs.
Similarly for the intermediate carries except that there are k − 1 member in
each group of each stage so there are M +1 less inputs in each unit described
above. And it is clear that the number of inputs for these intermediate carries
sums up to N−k
k−1 (
k(M+4)
2
−M − 1).
And the number of inputs sums up to 10N−3k+ k(M+4)
2
+N−k
k−1 (dk−1M e+d2k−2M e+
k(M+4)
2
−M − 1).
The discussion above gives a complexity of O(NM) of number of inputs.
3.3 General Scheme for Building New Adders
As was seen in Section 3.2, pi is important to compute ci. However pi computation
requires the use of ExOr gates which are not threshold functions. We show in this
section that ci can be computed by a majority gate if one uses an alternate property
of input bits. We call the adder based on this property as the Low Depth Threshold
Adder (LDTA).
Let gi and pi denote the carry generation and propagation properties of the i-th
bit position as in the GCLA. Recall that gi = aibi and pi = ai ⊕ bi, where ai and bi
are the i-th bits of the two operands.
30
3.3. GENERAL SCHEME FOR BUILDING NEW ADDERS
Define ti = gi+ pi = ai+ bi. One can show that the carry ci of the adder can be
expressed as
ci = gi + pici−1 = gi + tici−1. (3.19)
However, since gi = giti and ti = gi + ti, one can also express (3.19) using Theorem
1(P2) as
ci = giti + gici−1 + tici−1 = TH(gi, ti, ci−1; 1, 1, 1; 2) =MAJ(gi, ti, ci−1). (3.20)
Thus carry at position i can be obtained from ci−1 using a three input majority
function. In order to relate carry ci to some other previous cj, we generalize gi and
ti to multiple bits.
Let Gi:j, i ≥ j denote the proposition that carry ci is generated in the bit
positions j through i. Similarly, let Pi:j denote the proposition that this group of
bits propagate a carry. As before, define Ti:j = Gi:j + Pi:j. Clearly, Gi:j = Gi:jTi:j
and Ti:j = Gi:j + Ti:j. Thus, similar to (3.19) and (3.20) we have
ci = Gi:j + Pi:jcj = Gi:j + Ti:jcj−1.
= MAJ(Gi:j, Ti:j, cj−1) (3.21)
We now show that Gi:j and Ti:j which characterize bit positions j through i can
be expressed in terms of characterizations of smaller groups of bits. In particular,
for any j < k ≤ i, from the definition of Gi:j,
Gi:j = Gi:k +Gk−1:jPi:k
= Gi:k +Gk−1:j(Pi:k +Gi:k)
= Gi:k + Ti:kGk−1:j. (3.22)
Further, because of the relationship between Gi:k and Ti:k, this can be expressed as:
Gi:j =MAJ(Gi:k, Ti:k, Gk−1:j). (3.23)
Similarly,
Ti:j = Gi:j + Pi:j
31
3.3. GENERAL SCHEME FOR BUILDING NEW ADDERS
= Gi:k +Gk−1:jPi:k + Pi:kPk−1:j
= Gi:k + (Gi:k + Pi:k)(Gk−1:j + Pk−1:j)
= Gi:k + Ti:kTk−1:j. (3.24)
Again, using the relationship between Gi:k and Ti:k one can express this as:
Ti:j =MAJ(Gi:k, Ti:k, Tk−1:j). (3.25)
Equations (3.23) and (3.25) show that the G and T properties of a group of
bits can be obtained from the properties of its smaller partitions. Further, the
computation employs only 3-input majority gates.
Quantities Gi:j and Ti:j for smaller sets of bits can be be determined directly
from the input bits as follows. Carry is generated in bit positions j through i only if
it is generated in one of these bits and is propagated through the higher bits. Thus,
Gi:j = gi + pi(gi−1 + pi−1(gi−2 + · · · pj+1(gj)) · · ·). (3.26)
Using the fact that gk + pkX = gk + tkX for any k and X, one can also write (3.26)
as:
Gi:j = gi + ti(gi−1 + ti−1(· · · tj+1(gj) · · ·))
= aibi + (ai + bi)(ai−1bi−1 + (ai−1 + bi−1)(· · · (aj+1 + bj+1)(ajbj) · · ·))(3.27)
= TH(aj, bj, aj+1, bj+1, . . . , ai, bi; 1, 1, 2, 2, . . . , 2
i−j, 2i−j; 2i−j+1). (3.28)
The last step in (3.28) is obtained as in the examples following Theorem 1.
The expression for Ti:j can be obtained similarly using (3.26) as follows.
Ti:j = Gi:j + Pi:j = Gi:j + pipi−1 · · · pj
= gi + pi(gi−1 + pi−1(gi−2 + · · · pj+1(gj + pj) · · ·)).
This expression can be simplified as before to:
32
3.4. A LOW DEPTH THRESHOLD ADDER (LDTA)
Ti:j = gi + ti(gi−1 + ti−1(· · · tj+1(gj + tj)) · · ·)
= aibi + (ai + bi)(ai−1bi−1 + (ai−1 + bi−1) (3.29)
(· · · (aj+1bj+1 + (aj+1 + bj+1)(aj + bj)) · · ·)) (3.30)
= TH(aj, bj, aj+1, bj+1, . . . , ai, bi; 1, 1, 2, 2, . . . , 2
i−j, 2i−j; 2i−j+1 − 1).(3.31)
Note that as before, the last step of (3.31) is similar to the examples following
Theorem 1.
Similar to the computations (3.28) and (3.31) of Gi:j and Ti:j for small input
partitions, one can also compute the initial few carries directly from the input bits
using threshold functions. Recall from (3.21) that the carry ci can be obtained from
c−1 as
ci = Gi:0 + c−1Ti:0
= aibi + (ai + bi)(ai−1bi−1 + (ai−1 + bi−1)(· · · a1b1 + (a1 + b1) (3.32)
(a0b0 + c−1(a0 + b0)) · · ·)) (3.33)
= TH(c−1, a0, b0, a1, b1, . . . , ai, bi; 1, 1, 1, 2, 2, . . . , 2i, 2i; 2i+1). (3.34)
The derivation of (3.34) is based on (3.28) and (3.31) and uses Theorem 1(P6) to
convert the computation into a threshold function.
One can also obtain ci from cj, j < i, in a similar fashion.
ci = Gi:j + cj−1Ti:j (3.35)
= MAJ(Gi:j, Ti:j, cj−1) (3.36)
= TH(cj−1, aj, bj, aj+2, bj+2, . . . , ai, bi; 1, 1, 1, 2, 2, . . . , 2i, 2i; 2i+1). (3.37)
For convenience, we will often refer to Gi:j and Tij as the G and T functions over
the (bit index) range [i : j].
3.4 A Low Depth Threshold Adder (LDTA)
Basically, one can use the general scheme described above to generate numerous
types of adders by playing around the properties of Gi:j and Tij . We present here
33
3.4. A LOW DEPTH THRESHOLD ADDER (LDTA)
one strategy to design the N -bit Small Depth Adder using threshold functions with
fan-in bound ofM+1 can be described as follows. For convenience, assumeM = 2m
and N = 2n.
Design of the Low Delay Threshold Adder (LDTA).
1. Obtain ci, 0 ≤ i < M/2 using (3.34).
2. Obtain G and T over [jM/2 + k : jM/2], 0 ≤ k < M/2, 1 ≤ j < 2N/M using
(3.28) and (3.31).
3. For each i, 0 ≤ i < n−m, for 1 ≤ j < N/(2iM) and 2iM/2 ≤ k < 2i+1M/2,
obtain G and T over [j2i+m + k : j2i+m] from G and T over ranges [j2i+m +
k : j2i+m + 2i+m−1] and [j2i+m + 2i+m−1 − 1 : j2i+m], 2i+m−1 ≤ k < 2i+m,
1 ≤ j < 2n−m−i as in (3.23) and (3.25).
4. Obtain carries c(M/2)2i+k, 0 ≤ k < (M/2)2i using carry c(M/2)2i−1 and the
appropriate G and T using ci =MAJ(Gi:j, Ti:j, cj−1), 0 ≤ i ≤ n−m.
5. Obtain the sum bits from the carries using si = TH(ai, bi, ci−1, ci; 1, 1, 1,−2; 1).
Fig. 3.2 shows an 8-bit LDTA using threshold gates with a fan-in bounded by 5.
The threshold gates in this architecture are defined as:
A0 : TH(a2i, b2i; 1, 1; 2),
B0 : TH(a2i, b2i; 1, 1; 1),
A1 : TH(a2i, b2i, a2i+1, b2i+1; 1, 1, 2, 2; 4),
B1 : TH(a2i, b2i, a2i+1, b2i+1; 1, 1, 2, 2; 3),
C0 : TH(a0, b0, c−1; 1, 1, 1; 2),
C1 : TH(a0, b0, a1, b1, c−1; 1, 1, 2, 2, 1; 4), where i = 1, 2, 3.
Note that functions C0 and C1 in the Fig. 3.2 correspond to the step 1 of the
procedure explained above. Functions A0 and A1 compute the G over the ranges
[2i : 2i] and [2i + 1 : 2i] respectively as in step 2 of the procedure. Functions B0
and B1 compute the T over the same ranges. The four majority gates on the left in
the second row of Fig. 3.2 compute the G and T functions by combining G and T
functions in the first row as described in step 3 of the procedure (In this example,
34
3.4. A LOW DEPTH THRESHOLD ADDER (LDTA)
MAJ MAJ MAJ MAJ MAJ MAJ
MAJ MAJ MAJ MAJ
7:6a , b7:6 a 5:4 4 3:2 , b3:2 a2 , b2 a1:0, b1:0 0, b0ac c
−1 −1
c c c c
c c
c c01
3
4567
6 , b5:4 a , b4, b6a
2
a
CCAAAAAA BBBBBB 0000 000 1111111
Figure 3.2: Computing all the carries of an 8-bit LDTA with a fan-in bound of 5.
the only value level i assumes is 0.) Finally, the remaining majority gates in the
figure are used to compute the carries as in step 4 of the procedure.
The LDTA procedure described above gives the following Theorem.
Theorem 3 All the carries of an N-bit Low Depth Threshold Adder (LDTA) can be
obtained with a depth log2(4N/M) circuit using threshold gates with fan-in bounded
by M + 1. In all levels, except the first, this circuit uses 3-input majority gates.
The hardware complexity of this design is N(M +3 log2(N/M)+2)+3M/2−M2/4
inputs.
Proof. Let N = 2n and M = 2m. Steps 1 and 2 of the LDTA design procedure
follow directly from (3.28), (3.31) and (3.34). The threshold functions used in these
steps clearly satisfy the fan-in bound.
We now show that the G and T required in step 3 for any particular calculation
for a given i, j and k are available either from step 2 or from a smaller i in step
3. For convenience, we denote the range [j2i+m + k : j2i+m] in step 3, by R(i, j, k).
Similarly the range [jM/2 + k : jM/2] used in step 2 is denoted by R′(j, k). To
compute G and T over R(i, j, k), one needs T and G over two smaller ranges, namely,
35
3.4. A LOW DEPTH THRESHOLD ADDER (LDTA)
Adder Delay Max fan-in Number of gates
CPA N 4 2N
GCLA logN + 2 9 (4/3)(N − 1) + 3N
CLA logN (N/4) + 1 (5/2)N
LDTA 2 + log(N/M) 2M + 1 O(N log(N/M))
Table 3.1: Comparison of Carry Lookahead Adder, CPA, GCLA and our LDTA.
[j2i+m+k : j2i+m+2i+m−1] and [j2i+m+2i+m−1−1 : j2i+m]. One can verify that the
second of these ranges is R(i−1, 2j, 2i+m+1). (For i = 0, this range is R′(2j, 2i+m+1)
in step 2.) Since Gs and T s for i−1 are computed before those for i, Gsand T s over
R(i− 1, 2j, 2i+m+1) are available for the computation of Gs and T s over R(i, j, k).
To show that the Gs and T s over the other smaller range, [j2i+m + k : j2i+m +
2i+m−1], are also available for computing Gs and T s over R(i, j, k), we consider cases
based upon the value of k. When 2iM/2 ≤ k < (2i + 1)M/2, Gs and T s over the
required range are available from step 2 when they are computed over the range
R′((2j + 1)2i, k − 2i+m−1). When i = 0, this covers the entire k range. For other
i values, when (2i + 2t)M/2 ≤ k < (2i + 2t+1)M/2, 0 ≤ t < i, the required range
is R(t, (2j + 1)2i−t−1, k − 2i+m−1). Thus the required G and T over that range are
available for computing Gs and T s over R(i, j, k).
Finally, carry computation in step 4 requiresGs and T s over the range [(M/2)2i+
k : (M/2)2i]. It is easy to show that when i = 0, these are computed in step 2 over
the range R′(1, k) and when i > 0, in step 3 over the range R(i− 1, 1, k + 2iM/2).
The complexity (number of inputs to all gates) of the carry calculation circuit
of LDTA can be obtained easily from the complexity of gates in each step in the
procedure. Gates used in the steps 1 through 4 requireM(M+4)/4, (2N−M)(M+
2)/2, 3(N(n−m− 1)+M) and 3(N −M/2) inputs respectively. Adding these, one
gets the complexity of N bit LCTA carry computation as stated in the theorem.
The new adder is compared with others available in literature in Table.3.1.
One can see that the new LDTA is the only adder that allows a tradeoff among
36
3.5. A LOW COMPLEXITY THRESHOLD ADDER (LCTA)
the fan-in bound, delay and the complexity. The new adder can achieve a delay
better than logN . Moreover, it used only majority gates except for the first level.
Another design of adder can be easily derived with the design procedure of LDTA
which uses majority gates only including the first level. Also as mentioned before,
the sum bits can be implemented with 5-input majority gates. As a result, the whole
adder is composed of majority gates only. This makes the design perfectly suitable
for Quantum-dot Cellular Automata (QCA) which implements only majority gates
and invertors. The design procedure and analysis of the adder is provided in Chapter
6 along with other applications that are developed for QCA.
3.5 A Low Complexity Threshold Adder (LCTA)
One of the drawback of the LDTA presented in Section 3.4 is its O(N log(N/M))
hardware complexity. The principle reason for this can be traced to the fact that
in steps 1 and 2 of its design procedure, one applies many of the same inputs to
multiple threshold gates. This clearly increases the input complexity of the circuit.
One can combine the inputs in various threshold gates without overlap to reduce this
complexity. We refer to the resultant adder as the Low Complexity Threshold Adder
(LCTA). Design of an N -bit LCTA using threshold functions with fan-in bound of
M + 1 can be described as follows. For convenience, assume M = 2m and N = 2n.
Design of the Low Complexity Threshold Adder (LCTA).
1. Obtain G and T over the range [(j+1)M/2−1 : jM/2], 1 ≤ j < 2N/M using
(3.28) and (3.31).
2. For each level i, 0 ≤ i < n−m, obtain G and T over the range [(j+1)2i+m−1 :
j2i+m], 1 ≤ j < 2n−m−i from G and T over ranges [(j + 1)2i+m − 1 : (2j +
1)2i+m−1] and [(2j + 1)2i+m−1 − 1 : j2i+m] using majority gates as in (3.23)
and (3.25).
3. Compute carries c2i−1, m ≤ i ≤ n, from c2i−1−1 using appropriate T and G as
in (3.36).
37
3.5. A LOW COMPLEXITY THRESHOLD ADDER (LCTA)
4. For each level i, 0 ≤ i < n − m − 1, obtain carries c(4j+2)2i+m−1−1, 1 ≤ j <
2n−m−i−1 from c(4j+1)2i+m−1 using appropriate T and G as in (3.36).
5. Compute carries cMi+M/2−1, 0 ≤ i < N/M from cMi−1 using appropriate input
bits as in (3.37).
6. For each i, 0 ≤ i < 2N/M , compute carries c(M/2)i−1+j, 1 ≤ j < (M/2), from
c(M/2)i−1 using appropriate input bits as in (3.37). When M = 4, these carries
can be computed through three input threshold gates.
7. Obtain all the sum bits from the carries using (3.2) or (3.3).
Fig. 3.3 shows a carry computation circuit of a 16-bit LCTA with a fan-in bound
of 5, i.e, M = 4.
, b11:1011:10a a , b7:67:6, ba 13:1213:12 , b9:89:8a a 5:4 , b5:4, b15:1415:14a
A A BB A A BB A A BB
3:2 , b3:2a 1:0 1:0, ba −1c
CA BCCC
MAJ MAJ
MAJ
MAJ MAJ MAJ
MAJ
MAJ MAJ
MAJ MAJ
MAJ
MAJ
MAJ
MAJ
MAJ
MAJMAJ
MAJ
c
c
c
c
c c c
c15
2
4
6
8
10
12
14
c3c11
c7
c13
a , b14 14
MAJ
c0
c9
a10, b10
c5
a6, b6
a0 , b0
c1
a2 , b2
a12, b12
a8 , b8
a4 , b4
Figure 3.3: Carry computation in a 16 bit LCTA with fan-in bound of 5.
Threshold functions A and B in the figure compute G and T directly from the
inputs as stated in step 1 of the procedure. The three pairs of majority functions
in the following level and one pair in the next level correspond to the G and T
computation described in step 2 of the procedure. Computation of carries c3, c7 and
c15 corresponds to step 3 of the design. Carry c11 is obtained as per step 4 of the
38
3.5. A LOW COMPLEXITY THRESHOLD ADDER (LCTA)
design. Threshold functions C in the figure compute carries c1, c5, c9, and c13 as in
step 5. Finally, step 6 provides the remaining carries, namely, c2, c4, c6, c8, c10, c12
and c14.
The procedure described above gives the following Theorem.
Theorem 4 All the carries of an N-bit Low Complexity Threshold Adder (LCTA)
can be obtained with a depth 1 + log2(2N/M) log2(4N/M)/2 circuit using threshold
gates with fan-in bounded by M + 1. In all levels, except the first, this circuit uses
3-input majority gates. The hardware complexity of this design is N(5 + (M/2) +
14/M)− 2M − 6 log2(4N/M) inputs.
Proof. Let N = 2n andM = 2m. It is easy to see that steps 1 and 2 of the procedure
provide all the T ’s and G’s over the range [(s + 1)2t − 1 : s2t], m − 1 ≤ t < n − 1,
1 ≤ s < n − t. Therefore the T s and Gs required for carry computation in steps 3
and 4 are available. Steps 5 and 6 compute carries from smaller carries and input
operand bits. The threshold gates used satisfy the fan-in bound of M + 1 because
at most M operand bits are used in any of these computations. Further, because
each carry depends on a lower carry, they are all computable.
To prove that the carries computed in steps 3 through 6 cover all the carries
c0 through cN−1, we first show that steps 3, 4 and 5 provide all the carries with a
separation of M/2. In particular, we show steps 3, 4 and 5 compute carries ci such
that (i + 1)/(M/2) take all integer values from 1 to N/(M/2). Firstly note that
index of i of every carry ci computed in these steps is such that (i + 1) is divisible
by (M/2). Thus (i+ 1)/(M/2) is an integer and lies in the specified range. Clearly
any positive integer can be uniquely expressed as (i + 1)/(M/2) = q2p, for some
odd integer q and a non-negative integer p. It is easy to check that when q = 1,
the corresponding is are the indices of carries computed in step 3. Similarly, when
p = 0, the corresponding is are the indices of carries computed in step 5. Finally,
when q ≥ 3 and p > 0, the corresponding is are the indices of carries computed in
step 4. Thus steps 3, 4 and 5 obtain distinct carries and cover all carries separated
by (M/2). Since step 6 obtains all the intermediate carries, the procedure described
here provides all the carries c0 through cN−1.
39
3.5. A LOW COMPLEXITY THRESHOLD ADDER (LCTA)
To compute the depth of the LCTA, we first calculate the delay of each carry
cj for which (j + 1)/(M/2) has integer values. We obtain the depth of such a cj by
tracing its dependence on the previous carries. Let a carry cj1 is obtained directly
from a carry cj2 . The discussion above shows that we can obtain odd integers q1
and q2 and non-negative integers i1 and i2 such that (j1 + 1)/(M/2) = q12
i1 and
(j2 + 1)/(M/2) = q22
i2 . Then steps 3, 4 and 5 of the design procedure then show
that these quantities are related as follows.
q22
i2 =

q12
i1−1 if q1 = 1, i1 6= 0 (step 3),
(q1 − 1)2i1 if i1 = 0 (step 5),
(2q1 − 1)2i1−1 if q1 > 1, i1 6= 0 (step 4).
(3.38)
Carry cj1 can be obtained from cj2 if integer q12
i1 can be obtained from q22
i2
using rules specified in (3.38). We will denote this relation between the two integers
as q12
i1 ← q22i2 . We now consider two cases based on of q1 to obtain the chains of
integers that indicate the carries one has to go through to obtain cj1 .
Case 1. q1 = 1.
In this case one can see from (3.38) that the carry dependence is given by the chain
2i1 ← 2i1−1 ← 2i1−2 ← · · · ← 20 ← 0.
Note that the last integer of this chain, 0, corresponds to carry c−1. Since the
length of the chain is i1 + 1, we conclude that the delay of computing such a ci1 is
i1 + 1.
Case 2. q1 > 1.
The carry dependence chain in this case is given by
q12
i1 ← (2q1 − 1)2i1 ← (4q1 − 3)2i1−2 ← (8q1 − 7)2i1−3 ←
· · · ← (2i1q1 − 2i1 + 1)20 ← (2i1q1 − 2i1). (3.39)
The last integer in this length i1+1 chain, (2
i1q1−2i1) is closely related to the starting
integer q12
i1 . In particular, the binary representation of this integer is obtained from
the binary representation of q12
i1 by removing the 1 with the minimum weigh (at
40
3.5. A LOW COMPLEXITY THRESHOLD ADDER (LCTA)
bit position i1). Thus each 1 at the kth bit position in the binary representation
of q12
i1 can be removed with a carry chain of length k + 1. Notice that every time
one such 1 is removed, the weight of the representation decreases by 1. Eventually,
when only one 1 at position t is left in the representation, it can be removed with a
carry chain of length t+ 1.
¿From these two cases, one sees that to evaluate the delay of carry cj, in 2
n bit
LCTA, one should first express (j + 1)/(M/2) in its binary representation, i.e.,
(j + 1)/(M/2) =
k<n−m∑
k=0
xk2
k.
The delay d(j) of carry cj is then given by
d(j) =
n−m∑
k=0
xk(k + 1) (3.40)
The worst case delay is therefore given by the carry cj when every xk, 0 ≤ k ≤
n−m equals 1. The carry with the worst delay (within the set of carries with integer
values (j + 1)/(M/2)) has index j = (M/2)(2n−m+1 − 1)− 1 and from (3.40), has a
delay of (n−m+ 1)(n−m+ 2)/2.
Note that the calculation of the delays here assumes that when a carry is com-
puted, the required T and G are already available. We verify this assumption now.
Note that steps 1 and 2 compute T and G over the range [(s+1)2t− 1 : s2t] with a
delay of t−m+2. In step 3, computation of c2i−1 requires c2i−1−1 and T and G over
the range [2i−1 : 2i−1. Clearly the delay of these T and G and of c2i−1−1 is i−m+1.
Similarly in step 4, the carry computation requires c(4j+1)2i(M/2) and T and G over
the range [(4j + 2)2i(M/2) − 1 : (4j + 1)2i(M/2)]. These T and G require a delay
of i+ 1 while the delay of c(4j+1)2i(M/2) is at least i+m since there is a 1 in the bit
position i+m− 1 of the carry index. Thus the delays can be calculated only from
the delays of carry propagation.
Finally, step 6 of the design procedure shows that the delay of carries dependent
on this cj is exactly one more than that of cj. Therefore the delay of the carry
computation in LCTA is 1 + (n−m+ 1)(n−m+ 2)/2.
41
3.6. AN ENHANCED LOW DELAY THRESHOLD ADDER(ELDTA)
The complexity (number of inputs to all gates) of the carry calculation circuit
of LCTA can be obtained easily from the complexity of gates in each step in the
procedure. Gates used in the steps 1 through 6 require 4N −2M , 12(N/M)−6(n−
m+2), 3(n−m+1), 3(N/M)− 3(n−m+1), N +(N/M) and (NM/2)− (2N/M)
inputs respectively. Adding these, one gets the complexity of N bit LCTA carry
computation as N(5 + (M/2) + 14/M)− 2M − 6(n−m+ 2).
3.6 An Enhanced Low Delay
Threshold Adder(ELDTA)
The Low Delay Threshold Adder (LDTA) presented in the last section does not fully
utilize the power of the threshold functions to implement complex Boolean functions.
In particular, steps 3 and 4 of of LDTA design use 3 input majority gates irrespective
of the fan-in bound of the technology. Instead, if one uses more complex threshold
gates for these computations,, one can significantly reduce the hardware complexity
(both gates and interconnects) of the adder without significantly impacting the
delay.
To achieve this, we first explore computing G and T over a larger range from
G and T over multiple smaller ranges using single threshold functions. This would
generalize (3.23) and (3.25). Let i ≥ l > k > j and assume that the G and T over
sub-ranges [i : l], [l − 1, k] and [k − 1 : j] are available, then one can obtain G and
T over the range [i : j] as:
Gi:j = Gi:k + Ti:kGk−1:j
= Gi:l + Ti:lGl−1:k + (Gi:l + Ti:lTl−1:k)Gk−1:j
= Gi:l + Ti:l(Gl−1:k + Tl−1:kGk−1:j).
= TH(Gk−1:j, Tl−1:k, Gl−1:k, Ti:l, Gi:l; 1, 1, 1, 2, 2; 4), (3.41)
where the last step follows from Theorem 1(P5) and F0 through F5 are the Fibonacci
sequence elements equal to 1, 1, 2, 3, 5 and 8 respectively. (See also examples 5 and
6.)
42
3.6. AN ENHANCED LOW DELAY THRESHOLD ADDER(ELDTA)
Similarly, the T function may be computed from the three contiguous sub-ranges
as
Ti:j = Gi:l + Ti:l(Gl−1:k + Tl−1:kTk−1:j)
= TH(Tk−1:j, Tl−1:k, Gl−1:k, Ti:l, Gi:l; 1, 1, 1, 2, 2; 4), (3.42)
Note that both (3.41) and (3.42) use identical threshold functions (same weights
and threshold) to compute the pair of G and T functions over the range [i : j]. All,
but the first inputs of these functions are identical as well.
Similar to (3.41) and (3.42), G and T can also be obtained from G and T defined
over 4 or more intervals. For example, when i ≥ l > k > m > j, G and T over
the range [i : j] can be obtained from those over [i : l], [l − 1, k], [k − 1 : m] and
[m− 1 : j] using
Gi:j = TH(Gm−1:j, Tk−1:m, Gk−1:m, Tl−1:k, Gl−1:k, Ti:l, Gi:l; 1, 1, 1, 2, 2, 4, 4; 8),
Ti:j = TH(Tm−1:j, Tk−1:m, Gk−1:m, Tl−1:k, Gl−1:k, Ti:l, Gi:l; 1, 1, 1, 2, 2, 4, 4; 8).
Note that previously, G and T were computed either from other G and T func-
tions (see (3.23) and (3.25)) or from direct inputs (see (3.28) and (3.31)). To reduce
the complexity of the adder, we also allow computation of G and T from other G
and T as well as inputs to the adder. Consider the computation of Gi:j using Gk:j
and inputs at bit positions k + 1 through i. From (3.27) one gets
Gi:j = aibi + (ai + bi)(ai−1bi−1 + (ai−1 + bi−1)(· · · (ak+1 + bk+1)Gk:j · · ·))
= TH(Gk:j, ak+1, bk+1, ak+2, bk+2, . . . , ai, bi;
1, 1, 1, 2, 2, . . . , 2i−k−1, 2i−k−1; 2i−k). (3.43)
Note that this last expression is obtained by applying Theorem 1(P4) repeatedly
i− k times.
Similarly, from (3.30) one gets
Ti:j = aibi + (ai + bi)(ai−1bi−1 + (ai−1 + bi−1)(· · · (ak+1 + bk+1)Tk:j · · ·))
= TH(Tk:j, ak+1, bk+1, ak+2, bk+2, . . . , ai, bi;
1, 1, 1, 2, 2, . . . , 2i−k−1, 2i−k−1; 2i−k). (3.44)
43
3.6. AN ENHANCED LOW DELAY THRESHOLD ADDER(ELDTA)
Note that the threshold gates used to obtain Gi:j and Ti:j in (3.44) and (3.44) are
identical i.e., they use the same weights and the threshold.
Finally, in Section 3.3, carries were calculated either from G and T (see (3.21))
or from inputs (see (3.37)). To reduce the complexity, we now compute the carry ci
from Gk:j, Tk:j, a previous carry cj−1 and some inputs. From (3.35) one has express
ck as
ck = Gk:j + cj−1Tk:j = TH(cj−1,Gk:j,Tk:j;1,1,1;2). (3.45)
However, using a carry expression as in (3.33), one gets
ci = aibi + (ai + bi)(ai−1bi−1 + (ai−1 + bi−1)
· · · ak+2bk+2 + (ak+2 + bk+2)(ak+1bk+1 + ck(ak+1 + bk+1)) · · ·)) (3.46)
Since ck is a threshold function as in (3.45), one can use Theorem 1(P4) to show that
ci expression (3.46) is also a threshold function. Applying this theorem repeatedly
i− k times, one gets
ci = TH(cj−1, Gk:j, Tk:j, ak+1, bk+1, ak+2, bk+2, . . . , ai, bi;
1, 1, 1, 2, 2, 4, 4, . . . , 2i−k, 2i−k; 2i−k+1). (3.47)
The strategy to design the new N -bit adder using threshold functions with fan-
in bound of 2M + 1 can now be described as follows. Here for computational
convenience, we assumeN = (M+1)d−M+M2−1. As will be seen in the algorithm, d
represents the depth (number of levels) of the algorithm. We also need the following
definitions. Let Ri = M(M + 1)
i−M . Define sik, M + 1 ≤ i < d, 0 < sik < N
recursively as: sM+10 = RM+1, s
i+1
0 = s
i
0 +Ri and s
i
k = s
i
0 + kRi.
Design of the Enhanced Low Delay Threshold Adder (ELDTA).
1. For each level i, 1 ≤ i ≤ M + 1, obtain G and T over the range [sM+1k +
iM − 1 : sM+1k ], for all ks such that 0 < sM+1k < N , from the adder inputs
using (3.28) and (3.31), when i = 1; and using G and T over the range
[sM+1k + (i − 1)M − 1 : sM+1k ] and the proper adder inputs as in (3.43) and
(3.44) when 2 ≤ i ≤M + 1.
44
3.6. AN ENHANCED LOW DELAY THRESHOLD ADDER(ELDTA)
2. For each level i,M+2 ≤ i < d, obtainG and T over the range [sik+jM−1 : sik],
for js satisfying Ri−1 < jM ≤ Ri, from G and T in level (i− 1) with at most
one pair of G and T from levels lower than (i− 1) using (3.41) and (3.42).
3. For each level i, 1 ≤ i ≤ M + 1, obtain carries c(i−1)M+k, 0 ≤ k < M from
c(i−1)M−1 using (3.37).
4. For each level i, M + 2 ≤ i ≤ d, let ti = si−10 . Obtain carries cj, ti ≤ j < ti+1,
from cti−1 and
• adder inputs, when ti ≤ j ≤ ti +M − 2 using (3.37);
• G and T over [ti + Y (j)M − 1 : ti], where Y (j) = d(j − ti −M + 2)/Me
and adder inputs, when ti +M − 1 ≤ j < ti+1 − 1 using (3.47);
• G and T over [j : ti], when j = ti+1 − 1 using (3.36).
5. Obtain the sum bits from the carries using (3.2).
Step 1. For level i = 1 each G and T are computed from 2M inputs with one
threshold gate with fan-in bound of 2M + 1. For 2 ≤ i ≤ M + 1, one combines G
or T from level i− 1 and 2M inputs in a threshold gate with fan-in bound 2M + 1.
Step 2. From the definition of sik, one has
sik = s
i−1
k(M+1)+1 and s
i
k+1 = s
i
k +Ri. (3.48)
With (3.48), one can partition the range [sik + jM − 1 : sik], Ri−1 < jM ≤ Ri into
as many sub-ranges, each with Ri−1 elements, as possible:
[sik + jM − 1 : sik] = [sik + jM − 1 : si−1k(M+1)+t +Ri−1]⋃
1≤r≤t
[si−1k(M+1)+r +Ri−1 − 1 : si−1k(M+1)+r]. (3.49)
Note that since the number of elements in the original range, jM , may not be a
multiple of Ri−1, the first subrange may have less than Ri−1 elements. The value of
t is given by
t = djM/Ri−1e − 1. (3.50)
45
3.6. AN ENHANCED LOW DELAY THRESHOLD ADDER(ELDTA)
The expression for the first of these subranges can be rewritten as
[sik + jM − 1 : si−1k(M+1)+t +Ri−1] = [si−1k′ + jM − tRi−1 − 1 : si−1k′ ], (3.51)
where, k′ = (k(M + 1)+ t+ 1). When i > (M + 2), this range is identical to one at
level i − 1 since (jM − tRi−1) < Ri−1 from 3.50. Therefore the required G and T
over this range are already available. When i =M + 2, using the fact the Ri−1 is a
multiple of M , this subrange can be further simplified to
[sik + jM − 1 : si−1k(M+1)+t +Ri−1] = [sM+1k′ + i′M − 1 : sM+1k′ ], (3.52)
where, i′ ≤M+1 because the number of elements in the range≤ RM+1 =M(M+1).
This range is identical to the one used in level (M+1) of step 1, therefore the required
G and T over it are available at level (M + 2).
Similarly, for i =M+2, the expression for each of the last t sub-ranges in (3.49),
[si−1k(M+1)+r+Ri−1−1 : si−1k(M+1)+r] can be rewritten as [sM+1k′ +M(M +1)−1 : sM+1k′ ],
where k′ = k(M +1)+ r. This range is thus identical to the range for level M +1 in
step 1 of the algorithm. Therefore the Gs and T s over these ranges are available from
levelM+1. Similarly, when i > M+2, this range, [si−1k(M+1)+r+Ri−1−1 : si−1k(M+1)+r],
is identical to the level (i − 1) range since Ri−2 < Ri−1 < Ri. The required G and
T over this range are therefore available from level i− 1.
Finally, note that each G or T in step 2 is computed from one G or T and t pairs
of G and T . Thus it is a threshold function of 2t + 1 = 2djM/Ri−1e − 1 variables
from (3.50). But since jM ≤ Ri = (M + 1)Ri−1, this threshold function uses at
most 2M + 1 variables and can therefore be implemented by a threshold gate with
fan-in bound of 2M + 1.
Step 3. Follows directly from (3.37). In each carry calculation, exactly 2k inputs
and a previous carry are combined in a threshold gate with 2k + 1 inputs. Since
k < M , the fan-in bound of 2M + 1 is satisfied.
Step 4. We first show that the required carry cti−1 is available for use in level i.
When i =M + 2, this carry, cM(M+1)−1 is the carry of level M + 1 with k =M − 1
obtained in step 3. If i > M + 2, cti−1 is the carry obtained in level i− 1 of step 4
with the maximum value of j for that level.
46
3.6. AN ENHANCED LOW DELAY THRESHOLD ADDER(ELDTA)
When ti ≤ j ≤ ti + M − 2, carry cti−1 is combined with adder inputs at bit
positions ti to j. Therefore the number of inputs to the threshold gate is 2(j − ti +
1) + 1 ≤ 2M + 1, which clearly satisfies the fan-in bound.
Consider now values of j satisfying ti +M − 1 ≤ j ≤ ti+1 − 2. In this case, one
needs G and T over the range [ti+Y (j)M−1 : ti], where Y (j) = d(j−ti−M+2)/Me.
However, for the j values used in this case, 1 ≤ Y (j) ≤ Ri−1/M . When 1 ≤
Y (j) ≤ M + 1, the range over which G and T are needed can be expressed as
[sM+10 + i
′M − 1 : sM+10 ], where 1 ≤ i′ ≤ M + 1. These Gs and T s are available
from step 1. On the other hand, when RM+1/M = M + 1 < Y (j) ≤ Ri−1/M , we
can show that the Gs and T s needed are available from step 2. In particular, when
Ri′−1/M < Y (j) ≤ Ri′/M , M + 2 ≤ i′ ≤ i − 1, one can see that the Gs and T s
needed are over the range [si
′
0 + j
′M : si
′
0 ], where j
′ = Y (j) satisfies the conditions
required in step 2. Thus these Gs and T s are available from level i′ ≤ i− 1 of step
2. To compute carry cj in this case, one needs a threshold gate whose inputs are
the carry cti−1, a pair of G and T over [ti + Y (j)M − 1 : ti] and adder inputs at bit
positions ti+Y (j)M to j. Thus its fan-in is 3+2(j−ti−Y (j)M+1). However, from
the definition of Y (j), one has j − ti − Y (j)M ≤ M − 2. Therefore the threshold
gate fan-in is less than or equal to the fan-in bound of 2M + 1.
Finally, when j = ti+1 − 1, the G and T over the range [ti+1 − 1 : ti] are needed.
However, this range can be rewritten as [si−10 +Ri−1−1 : si−10 ]. When level i =M+2,
Ri−1 = M + 1 and therefore one has these G and T from step one, level M + 1. If
i > M + 2, one has the required G and T from step 2, level i− 1.
The N -bit ELDTA, N = (M + 1)d−M +M2 − 1, d ≥ M . has a delay of d + 1.
It uses 2d(N −M2 + 1)/(M + 1) + 2M2 − 2 threshold gates with a fan-in bound
of 2M + 1 ≥ 5 and has an interconnect complexity of MN + 6N − 6M − 4M2 +
(4MN +6N)/(M +1)− (2M − 2)(d−M − 1) + (M +2)((2NM − 2M3+2M)(d−
M − 2)− 2N + 4M2 + 4M)/(M(M + 1)).
Figure 3.4 shows a 30 bit ELDTA realized with threshold gates with fan-in bound
of 5. The description of the gates in the figure is provided in the legend.
47
3.7. DISCUSSION AND CONCLUSION OF THRESHOLD ADDERS
     
     
     
     
     
     
     
     
     
     
     
     












  
  
  
  
  
  
  
  
  
  
  
  












     
     
     
     
     
     
     
     
     
     
     
     












 
 
 
 
 
 
 
 
 
 
 
 












      
      
      
      
      
      
      
      
      
      
      
      












      
      
      
      
      
      
      
      
      
      
      
      












   
   
   
   
   
   
   
   
   
   
   
   












 
 
 
 
 
 
 
 
 
 
 
 












 
 
 
 
 
 
 
 
 
 
 
 












        
        
        
        
        
        






  
  
  
  
  
  
          







        
        
        
        
        
        






 
 
 
 
 
 
         







 
 
 
 
 
 
 
 
 
 
 
 
   
   
   
   
   
   
   
            
            
            
            
            
            
            












 
 
 
 
 
 
 







  
  
  
  
  
  
  
          
          
          
          
          
          
          














  
  
  
  
  
  
  
  
  
  
  
  
  
  














 
 
 
 
 
 
 
 
 
 
 
 
 
 














         
         
         
         
         
         
         
         
         
         
         
         












  
  
  
  
  
  
  
  
  
  
  
  












        
        
        
        
        
        
        
        
        
        
        
        












  
  
  
  
  
  
  
  
  
  
  











 
 
 
 
 
 
 
 
 
 
 
 











     
     
     
     
     
     
     







    
    
    
    
    
    
    







  
  
  
  
  
  
  







   
   
   
   
   
   
   
          
          
          
          
          
          













 
 
 
 
 
 
 
 
 
 
 
 
     
     
     
     
     
     
     
        
        
        
        
        
        












 
 
 
 
 
 
 






 
 
 
 
 
 
 
 
 
 
 
 
 













             
             
             
             
             
             
             
            
            
            
            
            
            
            
            
            
            
            
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             

































               
               
               
               
               
               
               
            
            
            
            
            
            
            
            
            
            
            
             
               
               
               
               
               
               
               
               
               
               
               
               
               
               

































 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 





















  
  
  
  
  
  
  







     
     
     
     
     
     
     







               
   
   
   
   
   
   
   







   
  
  
  
  
  
  
  








   
   
   
   
   
   
   







    
    
    
    
    
    
    
    








   
   
   
   
   
   
   







   
   
   
   
   
   
   







   
    
    
    
    
    
    
    







  
  
  
  
  
  
  







  
   
   
   
   
   
   
   







   
   
   
   
   
   
   
   








  
  
  
  
  
  
  
  








   
   
   
   
   
   
   







   
 
 
 
 
 
 
 








                 
 
 
 
 
 
 
 







  
 
 
 
 
 
 
 







   
   
      
   

  
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     



















 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 



















 
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    



















  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  



















  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
        
        
        
        
        
        
        

























   
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

























   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
        
        
        
        
        
        
        
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 







  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  


























    
    
    
    
    
    
    
    
    
    
    
    












 
 
 
 
 
 
 
 
 
 
 
 












 
 
 
 
 
 
 
 
 
 
 
 












   
   
   
   
   
   
   
   
   
   
   
   












 
 
 
 
 
 
 
 
 








 
 
 
 
 
 
 
 








 
 
 


   
 
 
 


  
         
  
  
  
  
  
  
  
  
  
  











        
        
        
        
        
        
        
         








    
 
  
  
  
  
  
  







   
   
   
   
   
   
   
   
   
   
   
   
   
   














 
 
 
 



  
  


 
 
  


   
    
    
    
    
    
    
    
     
     
     











      
  
 
    
 
 


   
   


  
  
  



Gates calculating carries’sTGates calculatingG ’sGates calculating
Figure 3.4: The architecture of a 30-bit ELDTA with fan-in 2M + 1 = 5.
Table 3.2: Comparison of the delay of threshold implementations of CPA, GCLA,
LDTA and ELDTA using gates with fan-in bound of 5.
N CPA GCLA LDTA ELDTA
8 9 8 4 5
16 17 9 5 6
32 33 12 6 7
64 65 13 7 7
128 129 16 8 8
256 257 17 9 9
3.7 Discussion and Conclusion of Threshold Adders
This work has developed a general strategy to design adder architectures directly in
terms of threshold functions that are compatible with nanoelectronics. To increase
48
3.7. DISCUSSION AND CONCLUSION OF THRESHOLD ADDERS
Table 3.3: Comparison of the complexity of threshold implementations of CPA,
GCLA, LDTA and ELDTA using gates with fan-in bound of 5.
CPA GCLA LDTA ELDTA
N gates inputs gates inputs gates inputs gates inputs
8 16 56 46 142 32 106 18 70
16 32 112 92 284 80 258 42 170
32 64 224 190 590 192 610 102 418
64 128 448 380 1180 448 1410 236 958
128 256 896 766 2382 1024 3202 514 2116
256 512 1792 1532 4764 2304 7170 1174 4848
the implementation reliability, we have assumed that the threshold gates used have a
bounded fan-in. Our strategy is based on defining new logic primitives Gs and T s of
operand bits (Section 3.3) such that their evaluation as well as combination requires
only threshold gates. These primitives may be computed over appropriate ranges
of operand bits, either directly (see (3.28 and (3.31)) or by combining those over
smaller ranges (see (3.22) - (3.25)). Once the appropriate Gs and T s are available,
the carries are computed from these with some previous carry (see (3.21)) or directly
from the input operand bits and some previous carry (see (3.34) and (3.37)). Sum
bits may be computed from the carries in one level of threshold gates. The chosen
interdependence between carries (the carry chain) determines the delay and the
hardware complexity of the adder. By experimenting with the carry chains, one can
choose a suitable compromise between the complexity and the delay.
This work has applied the strategy to obtain an adder architecture, LDTA, with
a very low delay. This adder has 3-input majority gates on all but the first and the
last levels. Table 3.2 and 3.3 compares the new LDTA architecture with CPA and
GCLA.
One can see from Table 3.2 and 3.3 that the complexity of LDTA is comparable
to that of GCLA till N = 64. However, the delay of GCLA is about twice as
large as LDTA. The delays of GCLA and LDTA are O(logN) and O(log(N/M))
respectively. Thus, devices with larger fan-in can further decrease the delay of
49
3.7. DISCUSSION AND CONCLUSION OF THRESHOLD ADDERS
LDTA. The CPA complexity is logarithmically lower than that of the LDTA (O(N)
versus O(N log(N/M))). However, as far as the delay is concerned, LDTA is far
superior (O(log(N/M)) as against O(N)).
The strategy presented in this work provides a new direction in adder architecture
exploration. It is applicable to multiple nanotechnologies because it exploits the
identical logic primitives in all these technologies. We believe that one can use the
tools developed in this work to design other new adders with the right balance of the
delay and the complexity, all the while remaining within the realistic fan-in bounds.
50
Chapter 4
Tree Implementation of
Combinational Functions
4.1 Introduction
As mentioned in Section 2.6, fan-in is an important factor of reliable digital design
with nanotechnology. In [42], a general scheme of threshold logic is provided that
can be used to decompose any threshold logic into a network of threshold gates
with bounded fan-in. This decomposition employs the novel concept of error of the
threshold function. It is shown that the value of the error is always non-negative
and can be obtained by adding (non-negative) errors due to independent groups of
inputs. When the total error of a threshold function exceeds the critical error, the
output of the function is 0, otherwise it is 1. Critical error is dependent on the
weights and the threshold of the function, and plays a central role in the design of
our network. The network can be visualized as made up of fragments which compute
the errors of different groups of inputs and a binary tree of recombiners which add
these errors to eventually compare it to the critical error.
Our work uses the same concepts of error and critical error for decomposing
a threshold logic, though with a different approach to collecting and calculating
the amount of error. To be more specific, in [42], a binary tree structure is used
51
4.1. INTRODUCTION
for this task. Our work extends this structure to a more generalized k-ary tree
to provide a greater flexibility in the decomposition architecture. We show that
this new procedure results in better performance in some applications. We give the
decomposition of comparison logic as an example. The resultant circuit is superior
in both speed and hardware complexity.
4.1.1 Background
We restate the concepts and some preliminaries of the decomposition scheme from
[42] here. Our work is based on the same concepts with extension and generalization.
The classical decomposition of threshold functions exploits the fact that thresh-
old functions are unate1. A unate function f(x1, x2, . . . , xn) can be decomposed
as [43]
f(x1, x2, . . . , xn) =
 f1(x2, x3, . . . , xn) + x1f2(x2, x3, . . . , xn) if x1 is positivef1(x2, x3, . . . , xn) + x1f2(x2, x3, . . . , xn) if x1 is negative (4.1)
If f in (4.1) is threshold, then so are f1 and f2. Equation (4.1) can be employed
recursively to reduce the fan-in of the threshold functions to any desired small value
M (< n). The final decomposition results into a binary tree of n −M levels with
2n−M −1 internal nodes, each representing the 3-input threshold function and 2n−M
leaves representing M -input threshold functions (not necessarily distinct). Clearly,
this decomposition has a depth of (n −M + 1) which is O(n −M) and a size of
O(2n−M).
In [44] an alternate decomposition of threshold functions into a network of
bounded fan-in threshold functions is proposed. The new decomposition has a poly-
nomial size (with respect to n). The blocks to which inputs are applied are called
the Fragments and the blocks which combine the outputs of the fragments into the
final output of the function are called the Recombiners. This scheme is shown in
Fig. 4.1. We show that each output of a fragment and a recombiner is a threshold
1A unate function is a Boolean function in which every variable is either positive or negative.
A variable (say) x1 is positive in f(x1, x2, . . . xn) if f(1, x2, . . . xn) ≥ f(0, x2, . . . xn) for all x2, x3,
. . . xn. It is negative, if f(1, x2, . . . xn) ≤ f(0, x2, . . . xn).
52
4.1. INTRODUCTION
0. . . . . .
. . . . . .
Fragment DFragment CFragment BFragment A
. . . . . .
. . . . . .
Recombiner FRecombiner E
Recombiner G
. . .. . .
M +1x x M2 MM
f0 f ReRe
x
0a0 aR 0b 0c dbR cR dR
gR
x1 M x M3 x+1 +12x 3x n
Figure 4.1: A threshold function of n variables and threshold T partitioned into
four fragments and recreated by a tree of recombiners. R equals sum of all positive
weights minus T .
function.
A global quantity, error of the threshold function is defined to avoid large
weighted sum of fragments or recombiners. Error E of the threshold function f
is defined as
E = K −
n∑
i=1
wixi, where K =
n∑
i=1,wi>0
wi. (4.2)
Clearly, E ≥ 0 because K is the largest value of∑wixi. This non-negative character
of E is ideal for information transfer between the fragments and recombiners.
Now define R as a constant R = K−T , where T is the threshold of the function
being decomposed. It is easy to see that R ≥ 0; otherwise the function would always
be 0. We can determine the output of the function from E and R. To do this, recall
that a threshold function f(x1, x2, . . . , xn) = 1 if and only if
∑n
i=1wixi ≥ T . But
from (4.2), this condition can be rewritten as
f(x1, x2, . . . , xn) =
 1 if E ≤ R0 otherwise. (4.3)
Because of (4.3), R is referred as the critical error of the threshold function.
53
4.1. INTRODUCTION
In the proposed decomposition, we compute the error of each fragment j based
on the set of indices Sj that determines its inputs as
Ej = Kj −
∑
i∈Sj
wixi, where Kj =
∑
i∈Sj ,wi>0
wi. (4.4)
Note that similar to E, Ej is also non-negative. By adding Ejs of all the fragments
one gets
∑
j
Ej =
∑
j
Kj −
∑
j
∑
i∈Sj
wixi
= K −
n∑
i=1
wixi = E. (4.5)
Thus the total error E needed to establish the value of the function by (4.3)
can be obtained by summing Ejs of individual fragments. We use a binary tree of
recombiners to add these Ejs. Since one is only interested in comparing E to R, a
fragment or a recombiner only needs to transfer the value of error if it is less than
or equal to R. Thus all the values (of error) transmitted within our network will be
bounded by 0 on the lower side and R on the higher side.
We use (R+1) outputs from each fragment to indicate the value of Ej. The tth
output from a fragment, a Boolean variable pt, 0 ≤ t ≤ R is defined as
pt =
 1 if error Ej in that fragment ≤ t0 if error Ej in that fragment > t (4.6)
Note that pts defined by (4.6) are related. In particular, one has
pi ≤ pj if i < j. (4.7)
Equation (4.7) has the following implications which are used later.
pi · pj = pi if i < j,
pi + pj = pj if i < j. (4.8)
The implementation of the fragments is stated as in the following theorem.
54
4.1. INTRODUCTION
Theorem 5 (Fragment outputs) Each output pt of fragment j is a threshold
function of inputs xi with weights wi, i ∈ Sj and a threshold of Kj − t.
Note that using (R + 1) output lines to carry log2(R + 1) bits of information
from a fragment seems wasteful. If one attempts to pass this same value as a binary
number of log2(R + 1) bits, then the output lines are indeed reduced, but unlike
pts, they are not outputs of threshold functions. The outputs of all the fragments
as defined here may be combined by a single threshold gate to provide a two level
decomposition of any threshold function as given in the following theorem.
It is also shown that all the outputs of any recombiner are themselves majority
functions. Further, each of these functions can be easily decomposed into threshold
functions with bounded fan-in.
Note that each recombiner outputs the combined error in all the fragments feed-
ing into it. As before, we are only interested in the error values between 0 and R
and can output these through (R+ 1) lines defined in the same manner as in (4.6).
Let pi, qi, 0 ≤ i ≤ R represent the inputs to a recombiner from its two parents. The
tth output of the recombiner, Boolean variable st, is based on the (2t+2) inputs pi,
qi, 0 ≤ i ≤ t and is given by the Boolean expression
st =
t∑
i=0
piqt−i, 0 ≤ t ≤ R. (4.9)
Relation (4.9) may be justified by noting that the product piqt−i is 1 only if one
parent of the recombiner has at most i errors and the other, at most t − i errors.
Thus, each term of the summation (4.9) accounts for a case when the total errors
are t or less. Each term is 0 if the combined error from the two parents is greater
than t.
Note that the output st of a recombiner is 1 if the total number of errors in all
the fragments, of whom it is a descendant, is less than or equal to t. Because of this
similarity of st with the output pt of a fragment, st also satisfies the condition (4.7).
Thus the output of a recombiner whose parents are recombiners behaves similar to
the output of a recombiner whose parents are fragments.
55
4.1. INTRODUCTION
Theorem 6 describes the threshold nature of each output of a recombiner and
Theorem 7 states that it can be decomposed into a network of majority gates.
Theorem 6 (Recombiner outputs) Function st defined by (4.9) is a majority
function.
Theorem 7 (Recombiner output implementation) Each output st defined by
(4.9) can be implemented using a multilevel network of generalized majority functions
with any given bound on the fan-in.
The hardware complexity of threshold decomposition can be reduced by making
intelligent groupings of the inputs based on their weights. We show that the choice
of the partition often affects the complexity of the fragments as well as that of
the recombiners. In particular, if the greatest common divisor (gcd) of the input
weights in a fragment is greater than 1, then some outputs of that fragment and its
descendant recombiners may be redundant. Similarly, if the weights in a fragment
are small compared with R, the fragment and its descendant recombiners may have
an output redundancy.
Two kinds of redundancies in the fragment and the recombiner outputs are
explored. Let the output of a fragment or a recombiner be denoted by pi, 0 ≤ i ≤
R. We often refer to this entire sequence of outputs simply by p. When every k
consecutive pis are the same irrespective of the input as in (4.10) below,
pkt+a = pkt, 0 ≤ a < k, for all t. (4.10)
we say that p has a block redundancy of k.
The second kind of redundancy arises because of the relationship between pis
given in (4.7). This equation suggests that when some pi = 1, all subsequent pis
are 1 as well. When pi = 1 for all i ≥ B irrespective of the input, we say that p is
bounded by B.
In [44], theorems of hardware reduction for a binary tree decomposition are
introduced. We describe them briefly here. The proofs are omitted for brevity. In
later section we generalize these theorems to k-ary tree decomposition.
56
4.1. INTRODUCTION
The following two theorems show the block redundancy and bound properties of
fragments to reduce the number of fragment outputs that are needed to be calcu-
lated.
Theorem 8 (Fragment Redundancy-I) If the weights in a fragment have the
greatest common divisor of g, then its output p has a block redundancy of g.
Theorem 9 (Fragment Redundancy-II) Let wi, i ∈ Sj denote the weights of
inputs to the jth fragment. Then the output of the fragment is bounded by
∑
i∈Sj |wi|.
The next three theorems allow us to reduce recombiner complexity using input
redundancies.
Theorem 10 (Recombiner Redundancy-I) If the input p of a recombiner has
a block redundancy of g, then the recombiner outputs can be expressed as
st =
bt/gc∑
i=0
pgiqt−gi, (4.11)
where q is its other input.
Theorem 11 (Recombiner Redundancy-II) Let inputs p and q of a recombiner
have block redundancies of g1 and g2 respectively. Then its output s has a block
redundancy of g = gcd(g1, g2).
Theorems 10 and 11 play an important role in minimizing the recombiner archi-
tecture when both its inputs have block redundancies. Theorem 11 shows that one
only needs to compute every gth output of such a combiner where g is the gcd of the
two block redundancies. Theorem 10 shows that the architecture for each of these
outputs can be reduced by a factor equal to the larger of the two redundancies.
We next explore the bounds on the output of a recombiner.
57
4.2. K-ARY TREE DECOMPOSITION STRUCTURE
Theorem 12 (Recombiner Redundancy-III) Let p and q denote the inputs of
a recombiner and g, the block redundancy of its output. Also, let pgi = 1 if i ≥ P
and qgi = 1 if i ≥ Q. Then the output of the recombiner satisfies
sgt = 1, if t ≥ P +Q. (4.12)
Note that Theorems 9 and 12 together imply that a recombiner output is bounded
by the sum of absolute values of the weights within all the fragments of whom it is
a descendant.
Finally we present a theorem that exploits the input block redundancies together
with their bounds.
Theorem 13 (Recombiner Redundancy-IV) Let inputs p and q of a recombiner
have block redundancies g1 and g2 respectively with g = gcd(g1, g2). Further, let p
and q be bounded such that pug = 1 for u ≥ P and qvg = 1 for v ≥ Q. Then, the
output of the recombiner is given by
sgt = p(t−Q)g + qb(t−P )g/g2cg2 +
Qg/g2−1∑
i=b(t−P )g/g2c+1
pgt−ig2qig2 (4.13)
4.2 k-ary tree decomposition structure
In the previous section, the concept of error is introduced. Also it shows how error
is defined in fragments and accumulated through recombiners. In [42], error is
accumulated through recombiners that have a binary tree structure. We extend
the structure into a k-ary structure. We show that the new structure can also be
used as a general decomposition scheme for threshold logic and that the hardware
reduction theorems described in the previous section are also applicable to the k-ary
tree decomposition.
58
4.2. K-ARY TREE DECOMPOSITION STRUCTURE
Figure 4.2 shows a general k-ary tree decomposition structure. Instead of recom-
bining outputs from two parent fragments or recombiners in the binary structure,
the number of parent fragments or recombiners that each recombiner recombines is
a parameter k.
0. . . . . .
. . . . . .
Fragment DFragment CFragment BFragment A
. . . . . .
. . . . . .
Recombiner FRecombiner E
Recombiner G
. . .. . .
x (k −1)kM +1 x (k −1)kM +1
f0 f ReRe
x x xx
0a0 aR 0b 0c dbR cR dR
gR
xx M1
. . . . . . . . .
. . .
M +1(k−1 )M k (k2 −1)M+1 N+M
Figure 4.2: A general structure of a k-ary tree decomposition.
The implementation and properties of the fragments of the k-ary tree decompo-
sition are the same as those of the binary tree that are shown in Theorem 5 and
Equations 4.6, 4.7, 4.8.
Instead of combining error from two parent fragments or recombiners as in the
binary tree decomposition, now each recombiner in the k-ary tree decomposition
would combine error from k parents. We define outputs from the j-th parent pjij ,
where 0 ≤ j ≤ k − 1 and 0 ≤ ij ≤ R. The output of the recombiner st is 1 if the
total error of all the parenting fragments or recombiners is at most t. Thus st can
be given by the Boolean expression
st =
∑
p0i0p
1
i1
· · · pk−1ik−1 , where i0 + i1 + · · ·+ ik−1 = t (4.14)
or
st =
∑ k−1∏
j=0
pjij , where
k−1∑
j=0
ij = t (4.15)
59
4.2. K-ARY TREE DECOMPOSITION STRUCTURE
.
Similar to Theorem 6 stating that st is a majority gate, the recombiner output st
of the k-art tree is a generalized majority gate, as stated in the following theorem.
Theorem 14 (k-ary tree recombiner outputs) Function st defined by 4.15 is
a generalized majority function.
Proof. Suppose there exist integers dj, 0 ≤ j ≤ k − 1 that
pjij =
 0 if 0 ≤ ij < dj1 dj ≤ ij ≤ t. (4.16)
Then st can be described as
st =
∑ k−1∏
j=0
pjij ,
k−1∑
j=0
ij = t
=
∑ k−1∏
j=1
pjij ,
k−1∑
j=1
ij = t− d0
= · · ·
= pk−1ik−1 , ik−1 = t−
k−2∑
j=0
dj. (4.17)
So for st to be 1, the last term p
k−1
t−
∑k−2
j=0
dj
should also be 1, indicating that dk−1 ≤
t−∑k−2j=0 dj ≤ t, thus∑k−1j=0 dj ≤ t. It also gives us that∑k−1j=0(t+1−dj) ≥ (k−1)t+k.
Note that t + 1 − dj is the number of pjij ’s that are 1. So the inequality simply
means that the total number of outputs from the parents that are 1 should exceed
(k − 1)t+ k, showing that st is a generalized majority function.
The decomposition of the recombiner output would be very similar to that of a
binary tree.
Similar to binary tree, k-ary tree decomposition also possesses the properties of
hardware reduction, including the block redundancy and the bound properties. We
now show that all the hardware reduction theorems that are developed for binary
tree decomposition are also applicable to the k-ary tree, giving it the same potential
of lowering hardware complexity when applied to specific arithmetic functions.
60
4.2. K-ARY TREE DECOMPOSITION STRUCTURE
Fragment block redundancy property (Fragment Redundancy-I) and frag-
ment bound property (Fragment Redundancy-II) are the same as shown in the
corresponding theorems in [44].
Note that when the output p has a block redundancy of g, the only outputs
one needs to compute are pgt, 0 ≤ t < (R + 1)/g. Each output pgt is a threshold
function with threshold Kj − gt, where Kj is the sum of all the positive weights
of the fragment. Thus all the weights in the fragment as well as its threshold are
multiples of g and consequently pgt can be implemented as a threshold function with
weights (wi/g) and a threshold of (Kj/g − t). Hence, the conditions of Theorem 8
not only imply fewer threshold functions, but also threshold functions with smaller
weights. To illustrate this theorem, consider a fragment with inputs x1, . . ., x4 with
weights 2, −2, 4 and −6. Since the greatest common divisor of the weights is 2, the
outputs of the fragment can be shown to be
p2t+1 = p2t = TH(x1, x2, x3, x4; 1,−1, 2,−3; 3− t).
The second kind of redundancy shows up in the fragment output when the
weights of inputs to a fragment are small relative to the critical error. In this case,
the fragment contributes only small errors leading to the bounds on its output.
Theorem 9 is important in reducing the complexity of a fragment that has small
weights in relation to weights of the other fragments. To illustrate this theorem,
once again consider the fragment with weights 2, −2, 4 and −6. The output of this
fragment is bounded by 14, i.e., output pi = 1, for all i ≥ 14 irrespective of the
input. Thus, no matter how large R is, one need not compute these pis.
We generalize the hardware reduction theorems for recombiners for k-ary tree
and show the proof of them.
Theorem 15 (k-ary tree recombiner redundancy-I) If the input p0 has a block
redundancy of g, then the recombiner outputs can be expressed as
st =
bt/gc∑
i0=0
p0gi0
k−1∏
j=1
pjij ,
k−1∑
j=1
ij = t− gi0. (4.18)
61
4.2. K-ARY TREE DECOMPOSITION STRUCTURE
Proof. As the input p0 has a block redundancy of g, then p0gi0+i′0
= p0gi0 , where
0 ≤ i′0 < g. Thus
st =
bt/gc∑
i0=0
p0gi0
k−1∏
j=1
pjij ,
k−1∑
j=1
ij = t− gi0 − i′0. (4.19)
From Equations 4.8, the summation equals the term with the largest index, which
is t− gi0, meaning ∑k−1j=1 ij = t− gi0.
Theorem 16 (k-ary tree recombiner redundancy-II) If the input pjij has a
block redundancy of gj and g = gcd(gj), 0 ≤ j < k−1, then st has a block redundancy
of g.
Proof. We will show that the recombiner output sgt+a, 0 ≤ a < g is independent of
a. Since the block redundancy gj is a multiple of g, from Theorem 15,
sgt+a =
t∑
i0=0
p0gi0(
∑ k−1∏
j=1
pjij),
k−1∑
j=1
ij = gt+ a− gi0
=
t∑
i0=0
p0gi0(
t−i0∑
i1=0
(
∑ k−1∏
j=2
pjij),
k−1∑
j=2
ij = gt+ a− gi0 − gi1
= · · ·
=
t∑
i0=0
p0gi0
t−i0∑
i1=0
p1i1 · · ·
t−
∑k−3
j=0
ij∑
ik−2=0
pk−2ik−2p
k−1
ik−1 , ik−1 = gt+ a− g
k−2∑
j=0
ij.(4.20)
Since pk−1ik−1 ’s have a block redundancy that is also a multiple of g,
pk−1
g(t−
∑k−2
j=0
ij)+a
= pk−1
g(t−
∑k−2
j=0
ij)
. (4.21)
Combining Equation 4.20 and 4.21 shows that sgt+a is independent of a.
Theorems 15 and 16 can be used to reduce the hardware by minimizing the
number of outputs of a recombiner that need to be calculated. Next we generalize
the bound property of recombiners to the k-ary tree.
Theorem 17 (k-ary tree recombiner redundancy-III) Let g denote the block
redundancy of a recombiner with inputs pjij and let p
j
gij = 1 if ij ≥ dj, then the
62
4.3. COMPARISON FUNCTION DECOMPOSITION
recombiner output satisfies
sgt = 1, if t ≥
k−1∑
j=0
dj. (4.22)
Proof. The recombiner output can be expressed as
sgt =
∑ k−1∏
j=0
pjij , where
k−1∑
j=0
ij = gt.
=
∑ k−2∏
j=0
pjijp
k−1
gt−
∑k−2
j=0
ij
. (4.23)
If t ≥ ∑k−1j=0 dj, then gt ≥ g∑k−1j=0 dj. Let ij = gdj, 0 ≤ j < k − 1, so that pjij = 1 for
0 ≤ j < k− 1, meaning the first k− 1 terms in the product term ∏k−2j=0 pjijpk−1gt−∑k−2
j=0
ij
are 1. Since gt ≥ g∑k−1j=0 dj, the index of the last term in the product gt−∑k−2j=0 ij ≥
gdk−1, indicating the k-th term in the product pk−1
gt−
∑k−2
j=0
gdi
= 1. Then at least one
product term in the summation on the right side of Equation 4.23 is 1, meaning
sgt = 1.
So far we have explored the implementation and hardware reduction properties
of the fragments and recombiners of a k-ary decomposition tree. We omit the gen-
eralization of the last theorem developed for the binary tree for simplicity since it
is simply a combination of the block redundancy and the bound property of the
recombiners. With the extended theorems developed for the k-ary tree decomposi-
tion structure, we are ready to explore their applications for key arithmetic digital
circuits.
4.3 Comparison Function Decomposition
Two N -bit numbers x = 〈xN−1, . . . , x2, x1, x0〉 and y = 〈yN−1, . . . , y2, y1, y0〉 may be
compared by the threshold function
TH(x,y; w,−w; 0), (4.24)
63
4.3. COMPARISON FUNCTION DECOMPOSITION
where vector w = 〈2N−1, . . . , 22, 2, 1〉. The output of this threshold function is 1 if
x ≥ y.
One can see that the comparison threshold function (4.24) is not a member of
L̂T 1 since its weights increase exponentially with n. In fact, it is often used to
show that L̂T 1 is a proper subset of LT1. This function has attracted quite a bit of
attention [39,45] because the number of inputs and the weights in this function get
rather large with an increase in n. To compare two 32 bit numbers as in (4.24), one
needs a threshold function with 64 inputs and weights as large as 231. [42] shows
that the methods allow one to decompose (4.24) in a variety of ways. For example,
this 64 input threshold function can be decomposed into 16 identical fragments and
15 identical recombiners arranged in a depth 5 network. Each fragment of this
network will have two 4-input threshold functions with a maximum weight of 2.
Each recombiner will also be made up of two 4-input threshold functions with a
maximum weight of 2.
While binary tree serves greatly for comparators with sizes that are exponents
of 2. The k-ary tree gives a more general structure of decomposing the comparison
function into bounded fan-in threshold network. For example, Figure 4.3 shows
the decomposition of comparing two 18-bit numbers with a fan-in bounded by 4.
Although it is more practical in real computers and digital circuits to have struc-
tures adapted to N = 2n-bit numbers, the k-ary structure provides an alternate of
decomposition and apply better to specific arithmetic functions. Here we show that
the k-ary tree decomposes comparator into bounded threshold gate networks with
lower hardware complexity and comparable depth.
The gates shown in the figure are defined as followed.
A : TH(a4j+1, a4j, b4j+1, b4j; 2, 1,−2,−1; 0), (4.25)
B : TH(a4j+1, a4j, b4j+1, b4j; 2, 1,−2,−1; 1), (4.26)
C : TH(B2,TH(A2, B1, A1, A0; 3, 2, 1, 1; 5); 1, 1; 1). (4.27)
The function C is a serial expansion expressed as B2 + A2(B1 + A1A0). Serial
expansions can be easily decomposed into threshold functions with bounded fan-in.
64
4.3. COMPARISON FUNCTION DECOMPOSITION
b17:16
B A B A A B A B A BB A B A
frag. 0
A B
recombiner
a1:0 , b1:0
A
a17:16 ,
A A B
A B2 2
B 1 1
0 0
frag. 8
 C  C C  C
B
 C  C
 C  C
Figure 4.3: A ternary tree decomposition of a comparator of size 18 with fan-in
bounded by 4.
In this case, there are 5 literals in the expression of C which exceeds the fan-in
bound of 4. Thus it is implemented with 2 threshold gates in series.
The resultant network has one level of fragments, each composes of two threshold
gates, followed by 2 levels of recombiners. Due to the feature of weights of input
bits, we show that the number of outputs from each recombiner is reduced to at
most 2. The correctness of the decomposition follows from the following theorem.
Here is the implementation of 2 N -bit numbers comparator using a k-ary tree
with bounded fan-in of 2M . We define Gl = 2
Mkl−1 for convenience. Gl has the
following property which would be used later.
Gl = (Gl−1)k. (4.28)
And also for convenience, we define qi = pgi when a recombiner has a gcd of g and
output pi.
Theorem 18 To implement a comparison of 2 N-bit numbers using a k-ary tree
with bounded fan-in of 2M , the only outputs required from any recombiner are pgtl
65
4.3. COMPARISON FUNCTION DECOMPOSITION
and p′gtl, where tl = Gl+1 − 1, t′l = Gl+1 − 2, and g is the block redundancy of that
recombiner.
Proof. We prove that qt′
l
and qtl can be expressed as
qt′
l
=
k−1∑
m=0
qmt′
l−1
k−1∏
n=m+1
qntl−1 , (4.29)
qtl =
k−1∑
m=1
qmt′
l−1
k−1∏
n=m+1
qntl−1 +
k−1∏
n=0
qntl−1 . (4.30)
qt′
l−1 and qtl−1 correspond to the outputs from the (l − 1)-th level. We first prove
(4.29).
The proof of (4.29) can be separated into two parts. First we will prove that
if the righthand of (4.29) is 0, then so is qt′
l−1 . Note that the case when there is
minimum error in i-th recombiner in level l when the output is 0 happens when
qmt′
l−1
= 0, m = 0, 1, · · · , k − 1. Thus the total error in i-th recombiner in level l is
bigger than or equal to the product of
∑k−1
m=0(t
′
l−1 + 1) and the gcd of (ki +m)-th
recombiner in level l − 1. So the total error satisfies
≥
k−1∑
m=0
(Gl − 1)Gki+ml
= Gkil (Gl+1 − 1)
> Gkil (Gl+1 − 2), (4.31)
and (4.31) is the product of t′l and the gcd of i-th recombiner in level l. Therefore,
as the error of this recombiner exceeds the critical error, the output qt′
l
= 0.
Then we prove that if the righthand of (4.29) is 1, then so is qt′
l
. We can prove
that by assuming any term of the summation equal to 1, that is,
qmt′
l−1
k−1∏
n=m+1
qntl−1 = 1. (4.32)
So the total error in i-th recombiner in level l is
≤ (t′l−1)Gki+ml + tl−1
k−1∑
n=m+1
Gki+nl
= (Gl − 2)Gki+ml +Gkil (Gkl −Gm+1l )
= Gkil (G
k
l − 2Gml ). (4.33)
66
4.3. COMPARISON FUNCTION DECOMPOSITION
And (4.33) is smaller than or equal to the total critical error of the i-th recombiner
in level l. Therefore, qt′
l
= 1.
The proof of (4.30) is similar, which can be divided into three parts. First we
prove that if
∑k−1
m=1 q
m
t′
l−1
∏k−1
n=m+1 q
n
tl−1 and
∏k−1
n=0 q
n
tl−1 both are 0, then so is qtl . The
total error in i-th recombiner in level l is counted to be
≥
k−1∑
m=1
(t′l−1 + 1)G
ki+m
l + (tl−1 + 1)G
ki+n
l
= Gkil (G
k−1
l − 1)Gl +Gki+n+1l
= Gkil (Gl+1 − 1) +Gki+n+1l
> Gkil tl, (4.34)
which is the critical error of that recombiner. Thus qtl = 0.
Then we prove if
∑k−1
m=1 q
m
t′
l−1
∏k−1
n=m+1 q
n
tl−1 = 1, then qtl = 1. As we can also prove
by assuming any term in the summation to be 1, this part of proof is similar to the
second part of proof of (4.29).
Last we prove that if
∏k−1
n=0 q
n
tl−1 = 1, then qtl = 1. As each term of the product
should be 1, the total error in i-th recombiner in level l is
≤ tl−1
k−1∑
n=0
Gki+nl
= Gkil (G
k
l − 1)
= Gkil tl. (4.35)
Thus we can get qtl = 1.
The depth and the hardware complexity of the comparison decomposed with k-
ary tree is given in the following theorem. Without loss of generality, we analyze the
depth and hardware complexity for complete tree only, namely, for 2 N -bit number
comparison, k is such that N =Mkn, where the fan-in bound is 2M .
Theorem 19 A comparison function of two N = Mkn-bit numbers can be decom-
posed into a network of threshold gates with fan-in bounded by 2M with a k-ary tree
67
4.3. COMPARISON FUNCTION DECOMPOSITION
structure. The network has a depth of 1+(logkN/M)d(2k−2)/(2M−1)e and a gate
complexity of 2N/M−1+d(2k−2)/(2M−1)e(2(N−M)/M(k−1)− logkN/M).The
interconnect complexity is 4N−2M+(2k−2+d(2k−2)/(2M−1)e)(2(N−M)/M(k−
1)− logkN/M).
Proof. From Theorem 18, the recombiners of a k-ary tree comparator can be ex-
pressed with a serial expansion in the same form described in (P5) in Chapter 2.
Along with (P7) we get that since the number of literals in the expression of the
recombiner output is 2k−1, each recombiner output needs d(2k−2)/(2M−1)e gates
to implement. The total number of inputs to these gates is then 2k − 2 + d(2k −
2)/(2M − 1)e. With this, the depth, number of gates and interconnect complexity
of a k-ary tree comparator can be obtained simply by calculation.
The k-ary tree comparator has one level of fragments and logkN/M levels of
recombiners. Each recombiner level is composed of d(2k− 2)/(2M − 1)e gates since
there are 2k−1 literals. Thus this leads the total depth to be 1+(logkN/M)d(2k−
2)/(2M − 1)e.
The architecture has N/M fragments, each of which has two threshold gates
according to Theorem 18 expect for the least significant one, which has only one.
Thus the number of gates in the fragments is 2N/M−1. The total number of inputs
to these fragment gates is thus 2M(2N/M − 1) since the fan-in bound is 2M .
For each level l of the recombiners, where 1 ≤ l ≤ n, there are N/(Mkl) recom-
biners, each of which has two outputs except for the least significant one. Thus the
total number of outputs required from the l-th level recombiners is 2N/(Mkl)− 1.
Each of these outputs has 2k−1 literals, thus is implemented with d(2k−2)/(2M−1)e
gates. The interconnect complexity of each recombiner output is 2k − 2 + d(2k −
2)/(2M − 1)e. Summing up all n levels of recombiners gives the total number of
gates and the total number of inputs as provided in Theorem 19.
68
4.4. CONCLUSION
4.4 Conclusion
This chapter has focused on developing a k-ary tree architecture to decompose any
threshold function in L̂T 1 into a network of threshold functions with bounded fan-
in. This work uses the same concepts of errors and critical error as in [44], but we
extend the binary tree structure developed in [44] to a k-ary tree structure, where
k is a parameter of the network, which makes the design more flexible for different
fan-in bound values. By choosing the right value of k, one can get resultant circuits
with lower depth and lower hardware complexity.
With the new architecture, the implementation of the recombiners becomes com-
plex. Thus, we also extend the hardware complexity reduction properties of frag-
ments and recombiners outputs from binary tree structure to k-ary tree structure.
Most of these properties can be extended in the similar manner, which are useful in
developing our application circuits.
We show the decomposition of a comparison function using a k-ary tree structure.
Due to the relation among the weights of the inputs of the function, we eliminate
the complexity of the recombiner implementation by using the extended hardware
complexity reduction theorems. As a result, the comparison function has a O(logN)
depth and a O(N) gate count and interconnect complexity.
Table 4.1 and 4.2 shows a comparison of the depth, gate complexity and inter-
connect complexity of the comparison function with a binary(k = 2) tree and a
quaternary (k = 4) tree with fan-in bounded by 4(M = 2) and 8(M = 4), respec-
tively.
One can see that when fan-in bound is 4, the depth of both comparators is
the same, while the quaternary tree comparator has slightly improved gate and
interconnect complexity. However in the case when fan-in bound is 8, while the
interconnect complexity is comparable, the quaternary tree has very improved depth
and gate complexity. One can see that for all values ofN , quaternary tree has almost
half as much depth as the binary tree implementation. As for the gate complexity,
the quaternary tree structure improves the binary tree structure by around 30%.
Since the binary tree has all 3-input majority gates in the recombiners levels,
69
4.4. CONCLUSION
Table 4.1: Comparison of binary tree comparator and quaternary tree comparator
with fan-in bound of 4.
k=2 k=4
N delay gate interconnect delay gate interconnect
complexity complexity complexity complexity
8 3 11 40 3 9 36
32 5 57 202 5 47 188
128 7 247 868 7 205 820
512 9 1013 3550 9 843 3372
2048 11 4083 14296 11 3401 13604
Table 4.2: Comparison of binary tree comparator and quaternary tree comparator
with fan-in bound of 8.
k=2 k=4
N delay gate interconnect delay gate interconnect
complexity complexity complexity complexity
16 3 11 68 2 8 63
64 5 57 326 3 39 304
256 7 247 1376 4 166 1289
1024 9 1013 5594 5 677 5250
4096 11 4083 22484 6 2724 21115
when fan-in bound is small (e.g. 4), it fits the 3 inputs into one threshold gate fine.
However, it fails to fully harness the larger fan-in bound of threshold gates. As a
result, for example, when fan-in is bounded by 8, it wastes more than half of the
inputs to each threshold gate. Thus the gate count would be large for binary tree
implementation. However, the k-ary tree structure allows one to choose how many
inputs to each recombiner gate by changing the value of k. Thus one can choose the
right value of k to make full use of the fan-in of each threshold gate. Recall that the
number of literals of each recombiner gate is 2k−1, thus for fan-in equal to 8, k = 4
would be the best choice of k to fit all inputs into one gate. In fact, it is always
good to choose k to be equal to M so that each recombiner output is implemented
70
4.4. CONCLUSION
with only one gate thus to take full advantage of large fan-in bounds.
71
Chapter 5
Systolic Implementation of
Threshold Function
Decomposition
5.1 Introduction
5.1.1 Systolic Architecture
Systolic Architecture is developed for many hardware implementation given the
advantages of
• Local communications
• regularity
• cost efficiency.
Systolic architecture is well developed in signal processing including FFT imple-
mentation [46], image processing [47], and convolution [48], etc. We take convolution
as an example to illustrate the functioning of systolic architectures.
In this chapter, we develop a systolic architecture for implementing threshold
logic using threshold gates with bounded fan-in. Systolic architecture is well suitable
72
5.1. INTRODUCTION
for nanoelectronics because of two major features,
• localized connections implying less complex interconnects, and
• bounded fan-in.
However, the 4-phase clocking scheme does not fit the systolic architecture. As
mentioned before, the 4-phase clock is perfect for pipelined architecture where the
data flows in one direction. For example, a combinational logic circuit can be im-
plemented as a tree such that the computation only goes in one direction. However,
for more complex architectures such as a systolic architecture, where the data flow
is in both directions, a new clocking scheme is required, which will be described in
the subsequent sections.
5.1.2 Decomposition of Threshold Functions
As mentioned before, it is practically important to restrain the fan-in of thresh-
old networks to a bounded number. We explore some decomposing strategy for
threshold networks.
In [44], a strategy of decomposing threshold functions into a network of bounded
fan-in threshold functions is proposed. Here we use the same definition and imple-
mentation of the outputs from the fragments and the recombiners as in [44]. Also
some of the hardware complexity reduction theorems developed in [44] are useful to
our systolic architecture implementation. Thus we restate these theorems here.
(Fragment outputs) Each output pt of fragment j is a threshold function of
inputs xi with weights wi, i ∈ Sj and a threshold of Kj − t.
(Fragment Redundancy-II) Let wi, i ∈ Sj denote the weights of inputs to
the jth fragment. Then the output of the fragment is bounded by
∑
i∈Sj |wi|.
(Recombiner outputs) Function st defined by (4.9) is a majority function.
In [44], a binary tree is used to recombine the signals carrying the error. In our
work, we use a different recombining approach, which enables us to implement it
using a novel systolic system.
73
5.2. SYSTOLIC SYSTEM FOR NANOTECHNOLOGY
5.2 Systolic System for Nanotechnology
5.2.1 Clocking Scheme
First we propose a new clocking scheme that is different from the classic four-phase
clock. Fig.5.1 shows the diagram of the new six-phase clocking scheme.
clk1
clk2
Time
WE EH H H H H H RR W
E E H HH H HH RR WW
Figure 5.1: A six-phase clocking scheme with the evaluate (E), hold (H), reset (R)
and wait (W) states.
If two threshold gates are cascaded, with this new clocking scheme, we can see
that when one of them is in its evaluation phase. Each clock has 3 holding phase so
that gate with clk1 holds the data longer enough for clk2 to evaluate. it is guaranteed
that the other one is providing stable data and vice versa.
5.2.2 General Scheme of Decomposing a Threshold Func-
tion with Systolic System
Any large threshold function can be implemented as a tree of threshold functions as
in Fig. 5.2. N and M is the number of inputs and the fan-in bound of the circuit,
respectively. gR is the output of the system and R denotes the critical error.
As shown in the figure, we recombine two groups of outputs from two fragments
first and then add one group at a time.
As described in Theorem 5, for Fragment i in Fig. 5.2, there are (R+1) threshold
gates. The threshold gate gives the output pij of the fragment is described as
pij = TH(xiM , xiM+1, · · · , x(i+1)M−1;wiM , wiM+1, · · · , w(i+1)M−1;
(i+1)M−1∑
k=iM,wk>0
wk − j).
(5.1)
74
5.2. SYSTOLIC SYSTEM FOR NANOTECHNOLOGY
. . . . . .
Fragment DFragment CFragment BFragment A
. . .. . .
. . . . . . . . .. . .
. . .
. . .
. . .
. . .
0p1 Rp1Rp
0p00 p02 pR2
Rg
p R
N/M −1pN/M0
−1
N/M
0q
−3
x xxx M
. . .
2M x x3 M N−M xN0x
. . .
M2M−1 −1 −1 −1
. . .
0q0 q
0
−3N/Mq(N/M−1
R2
)R
Figure 5.2: General scheme of decomposing a threshold function by serially combin-
ing the fragment outputs.
As we can see from Fig. 5.2, the outputs of the fragments are parallel. Since the
systolic system implementation requires one group of the inputs to be parallel while
the other serial, we use the following scheme to convert one group of the outputs
from parallel to serial.
In Fig. 5.3, z works as an outside control signal determining whether the gates
load or shift. When z = 0 at the first clocking cycle when gates Ci are triggered,
the gates Ci are loaded with p
0
i from the fragments. After that, C
′
i are clocked, they
are loaded with p0i . Then z is set to 1 afterwards so the convertor does the shifting
operation. Buffers Bi are used to temporally store the data p
0
i so that when gates
C ′i are triggered in the shifting operation, they get the data shifted from the left
gate, thus forming a serial sequence of p0i .
By adding error from one fragments to the total each time, one can get the total
error of the inputs and compare it with the critical error of the threshold function.
75
5.2. SYSTOLIC SYSTEM FOR NANOTECHNOLOGY
1
1
−1
1
1
−1
1
1
−1
1
1
−1
p0
R
21
1 2
21
1 2
21
1 2
21
1 21 1 1 1 1 1 1 1
p
R −1
0 0p
1
p0
0
...
...
...
0
z
R R R −1 R −1 1 1 0 0
C C C C
C’ B C’ B C’ B C’ B
R −1R 1 0
Figure 5.3: Parallel to serial convertor.
Thus gR can be obtained as
gR =
∑
iN/M−1
(
∑
iN/M−2
· · · (∑
i2
(
∑
i1
∑
i0
p0i0p
1
i1
)p2i2) · · · pN/M−2iN/M−2 )p
N/M−1
iN/M−1 . (5.2)
We implement the calculation of gR using the novel systolic system. Each sys-
tolic stage is used to add the error of one fragment to the total error. Thus the
recombining part of the circuit is implemented as shown in Fig. 5.4.
To validate the functionality of the systolic implementation of the recombining
circuite, without loss of generality, we now show that the first recombining stage
that recombines p0i0 and p
1
i1
can be implemented using systolic array as shown in
Fig.5.5.
We apply the six-phase clock and its delayed version to alternate gates in Fig.5.5.
Since all the loops in the systolic array shown here have an even length, the data
applied to each gate input stays stable while that gate is computing.
We use induction to prove that this systolic array would output the recomining
data of p0i0 and p
1
i1
. We only show the case when R is even. The verification with
odd R is similar.
It is clear from Fig.5.5 that when the corresponding gate is evaluated, we have
the following.
x2i+1 ← x2i,
x2i ← x2i−1,
76
5.2. SYSTOLIC SYSTEM FOR NANOTECHNOLOGY
pN/M0
p p p p0 1 2 R1 1 1 1
22
11 22
11 22
11 22
11
1 11 11 111 ...
...
.
.
.
22
11 22
11 22
11 22
11
1 11 11 111 ...
...
22
11 22
11 22
11 22
11
1 11 11 111 ...
...0g1 ...gR...
01pR
0 p0R ... p
0
−1 p
0
p p p p0 2 R2 2 2 2
−1 −1 −1pN/M2p
N/M
1
N/M −1p R
1
g
0
0
0
...
Figure 5.4: Systolic implementation of the recombiners in Fig.5.2.
22
11 22
11 22
11
1 11 11 111 ...
...
q q q q0 1 2 R
p01pR pR ... p−1
s0 s 1 s
...
...
...R−1sR 022 11
Figure 5.5: The first recombining stage.
y2i ← x2ip12i + y2i+1,
y2i+1 ← x2i+1p12i+1 + y2i+2.
With these, we now show that at cycle t after y2i+1 and x2i are evaluated, the
data held in each gate is
x2i = p
0
t−i,
x2i+1 = p
0
t−i−1,
77
5.2. SYSTOLIC SYSTEM FOR NANOTECHNOLOGY
Table 5.1: data of gates at cycle t and t+ 1.
x2i x2i+1 y2i y2i+1
at cycle t
after clocking p0t−i p
0
t−i−1
∑t−1−i
j=0 p
0
jp
1
t−j+i−1
∑t−1−i
j=0 p
0
jp
1
t−j+i
x2i and y2i+1
at cycle t p0t−ip
1
2i
after clocking p0t−i p
0
t−i +
∑t−1−i
j=0 p
0
jp
1
t−j+i
∑t−1−i
j=0 p
0
jp
1
t−j+i
x2i+1 and y2i =
∑t−i
j=0 p
0
jp
1
t−j+i
at cycle t+ 1 p1t−ip
1
2i+1
after clocking p0t−i+1 p
0
t−i
∑t−i
j=0 p
0
jp
0
t−j+i +
∑t−1−i
j=0 p
0
jp
1
t−j+i+1
x2i and y2i+1 =
∑t−i
j=0 p
0
jp
1
t−j+i+1
at cycle t+ 1 p0t−i+1p
1
2i
after clocking p0t−i+1 p
0
t−i+1 +
∑t−i
j=0 p
0
jp
1
t−j+i+1
∑t−i
j=0 p
0
jp
1
t−j+i+1
x2i+1 and y2i =
∑t−i+1
j=0 p
0
jp
1
t−j+i+1
y2i+1 =
t−1−i∑
j=0
p0jp
1
t−j+i,
y2i =
t−1−i∑
j=0
p0jp
1
t−j+i−1.
As the two clocks are applied to gates alternately, the data of each gate in cycle
t and the following t+ 1 can be evaluated as in Table 5.1.
5.2.3 The Whole System
To show how the whole system is connected, we show the fragments, parallel to serial
convertor and the first systolic stage in Fig. 5.6. The whole system is connected in
the same manner.
Before input x is applied to the fragments, control signal z is held high. Because
of the 0 applied, the output of every gate in the parallel to serial convertor and
systolic recombiner is reset. This works as initialization of the whole system before
the input comes in.
78
5.2. SYSTOLIC SYSTEM FOR NANOTECHNOLOGY
1
1
−1
1
1
−1
1
1
−1
1
1
−1
21
1 2
21
1 2
21
1 2
21
1 21 1 1 1 1 1 1 1
p01 p11 p21 pR1p
0
R pR −10
0p
1 p00
x
−10: M x M: 2 M−1
...0
z
22
11 22
11 22
11 22
11
1 11 11 111 ...
...
......
......
fragment 0 fragment 1
......
......
......
......
......
the first systolic stage
parallel to serial convertor 0
Figure 5.6: The first stage of the complete system.
When the input x is applied, as soon as the gates of the fragments have steady
output pji , 0 ≤ i < M , 0 ≤ j < N/M , the clocking of those gates is held high to
make sure pji is available throughout the whole computational period.
The complexity and the delay analysis of the decomposed network is described
in the following theorem.
Theorem 20 An N bit threshold function with critical error R can be decomposed
into a network of threshold functions with fan-in ≤ M for any M . This network
has a size 3NR/M + 5N/M +R− 1 and a time delay N/M +R + 1.
Proof. With the description of the fragments, parallel to serial convertor and systolic
recombiner, it is trivial to get the number of gates needed for each part is (R +
1)N/M ,3(R+1),and 2(R+2)(N/M − 1), respectively. Summing up the three gives
the total number of gates as stated. Note that with the new clocking scheme, two
consecutive gates form a clocking cycle. And the unit of time delay stated is clocking
cycle. Thus the fragments and the parallel to serial convertor together has a delay
of 2, adding to the delay of the recombiner, which is (N/M − 1)+R, gives the total
system delay.
79
5.3. APPLICATIONS AND EXAMPLES
5.3 Applications and Examples
We provide two example applications using the systolic implementation of decom-
position.
It is interesting to note that it is possible to implement any generalized majority
function with 2n inputs using the same circuit, despite the threshold of the gener-
alized majority function. Similarly, the exactly same systolic architecture can be
employed in every 2n bit approximate pattern matching application irrespective of
the number of errors that is to be tolerated.
5.3.1 Majority Gate
Theorem 21 The N bit generalized majority function with any value of threshold
can be decomposed into a network of threshold functions with fan-in ≤ M for any
M . This network has a size 3N + 2N/M +M − 2 if M ≤ R + 1, or 3N2/M −
3TN/M+5N/M+N−T−1 if M > R+1, and a time delay N+N/M−T+1,where
T is the threshold of the generalized majority gate and R is the critical error.
Proof. Recall Theorem 9, as the weight of each input bit is 1, the bound of each
fragment gate isM . Thus the number of gates in each fragment is eitherM or R+1,
depending on which is less. When M > R + 1, the gates in fragments are bounded
by R + 1, then the hardware complexity and the time delay is exactly the same as
stated in Theorem 20. For the case when M ≤ R + 1, then the fragment gates are
bounded by M . Substituting R + 1 with M gives the complexity and delay of the
case. Note that for generalized majority gate, R = N − T .
An example of a 16-bit majority function with a fan-in bound 4 is given in
Fig. 5.7. Note that for a 16-bit majority function, the threshold T = 9 and the
critical error R = 7. Thus, g7 is the output.
The fragment gates are threshold gates as follows.
A: TH(x4i, x4i+1, x4i+2, x4i+3; 1, 1, 1, 1; 1),
B: TH(x4i, x4i+1, x4i+2, x4i+3; 1, 1, 1, 1; 2),
80
5.3. APPLICATIONS AND EXAMPLES
1
1
−1
21
1 2
21
1 21 1 1 1
1
1
−1
1
1
−1
21
1 2
21
1 21 1 1 1
22
11 11
1 11 1
22
11 22
11 22
11
1 11 111
22
11 11
1 11 1
22
11 22
11 22
11
1 11 111
22
11 11
1 11 1
22
11 22
11 22
11
1 11 111
x 12:15x 8:11x 4:7x 0:3
1
1
−1
g0g1 ...g7 ...
z
0
A B C D D C B A D C B A D C B A
Figure 5.7: A 16-bit majority gate with fan-in bound 4.
C: TH(x4i, x4i+1, x4i+2, x4i+3; 1, 1, 1, 1; 3),
D: TH(x4i, x4i+1, x4i+2, x4i+3; 1, 1, 1, 1; 4).
The rest of the circuit consists of the parallel to serial convertor and 3 systolic stages,
which are described in Sec. 5.
5.3.2 Pattern Matching Machine
In many quality control and robotics applications, one has to compare a pattern
captured by sensors with a stored template. In most of these applications the
comparison needs to allow for a certain number of sensor errors. It is known that
this problem of error tolerant pattern matching for binary patterns can be solved
by a single threshold logic circuit [42]. Let binary vectors x and y denote the input
and the template respectively. The weight vector is created from y by replacing
all the zeros in it by −1s. The threshold is chosen to be wt(y) − ², where ² is the
error tolerance and wt(y) denotes the weight of the y. For example, the threshold
function
TH(x; 1, 1, 1,−1,−1, 1,−1, 1; 2)
will output a 1 if the 8-bit input vector xmatches with the pattern 〈1, 1, 1, 0, 0, 1, 0, 1〉
with three or less errors.
In most applications, the number of inputs to this threshold function may get
very large, rendering the threshold function impractical. In such cases, the methods
of this work can be used to decompose the function into smaller threshold functions
as described by the following theorem.
81
5.4. CONCLUSION AND DISCUSSION
Theorem 22 The N bit pattern matching function detecting any number of errors
can be decomposed into a network of threshold functions with fan-in ≤M for anyM .
This network has a size 3N+2N/M+M−2 ifM ≤ R+1, or 3NRε/M+5N/M+ε−1
if M > R + 1, and a time delay N/M + ε+ 1, where ε is the number of errors.
The proof of the Theorem 22 is very similar to that of Theorem 21. Note that
the critical error R = ε.
5.4 Conclusion and Discussion
This chapter proposes a novel decomposing scheme for general threshold functions,
which are implementable with nanotechnology. The resultant circuits have bounded
fan-in for the sake of reliability. Except for the first two levels, input fragments and
parallel to serial converter, respectively, the circuit is implemented using systolic
architecture, which leads to very low hardware complexity. With such architecture,
we are able to add the error of one fragments each time to the total to get the total
error of the circuit, which would be compared to the critical error to determine the
output of the threshold logic.
We show examples of two key logic, generalized majority gate and pattern match-
ing machine to illustrate the general scheme of decomposing threshold functions.
Table 5.2 and Table 5.3 give the comparison of the gate count and depth of the
majority gate implemented with the new scheme to that proposed in [1] and [2]. As
can be seen from the table, the depth of the majority gate is sacrificed, however,
the saving in hardware complexity is considerable. Take N = 64 and fan-in bound
M = 4 as an practical example, the speed is slowed down, but the hardware is only
about 0.14% of the original implementation.
The systolic system is able to implement the recombiners of any decomposed
threshold gate by combining errors from two fragments at first and adding error
from one more fragment each time. The resultant architecture has a gate count of
O(NR/M) and a time complexity of O(N/M+R), where R is the critical error of the
threshold gate. It is obvious that the critical error R is crucial to both the hardware
82
5.4. CONCLUSION AND DISCUSSION
Table 5.2: Size of majority gate implemented with bounded fan-in. Results of [1,2]
are shown in parenthesis for comparison.
N M
2 4 8 16
4 16
(7)
8 32 30
(31) (13)
16 64 58 58
(511) (77) (25)
32 128 114 110 114
(16383) (2493) (217) (49)
64 256 226 214 214
(1.05E6) (1.6E5) (13945) (689)
Table 5.3: Depth of majority gate implemented with bounded fan-in. Results of [1,2]
are shown in parenthesis for comparison.
N M
2 4 8 16
4 4
(3)
8 8 6
(6) (3)
16 16 12 10
(10) (6) (3)
32 32 24 20 18
(15) (10) (6) (3)
64 64 48 40 36
(21) (15) (10) (6)
complexity and the time complexity. For application for systolic implementation
of threshold gate decomposition, we also analyze the comparison function, which is
also a very important arithmetic unit used in computers. The resultant decomposed
83
5.4. CONCLUSION AND DISCUSSION
circuit turns out to have large gate count and delay. The reason for this is analyzed
thus. Unlike generalized majority function and pattern matching function, which
has unity weights, the comparison function has exponential weights, leading to large
bounds of fragments. Combined with the fact that the critical error of a comparison
function is also large (2N −1 for the comparison of two N -bit numbers), the number
of gates each systolic stage is composed of is large as well. For example, an 8-bit
comparison function with fan-in bound of 4 will have 4 fragments in the first level,
which have bounds of 6, 24, 96, 384, respectively. The critical error of this function
is 255. Thus, the number of gates for each systolic stage is 194, 50, 17, respectively.
And due to the large critical error, the delay of the implementation is also large.
Thus one can see that, since the time and hardware complexity of a decomposed
threshold gate implemented with systolic system is dependant with the critical error
and the bounds of the fragments. Also the bounds of the fragments are related to
the weights of the inputs. The systolic implementation is ideal for threshold gates
with smaller weights. As a result, the generalized majority function and the pattern
matching function are well suited applications of the systolic implementation scheme.
The decomposing scheme that is proposed in this work fits for any general
threshold function. It provides a reliable implementation of threshold functions
with bounded fan-in and extremely compact architecture.
84
Chapter 6
Digital Circuits for Quantum-Dot
Cellular Automata
6.1 Introduction
As mentioned in Chapter 5.1, QCA is one of the most interesting realizations of
threshold gates. QCA can be implemented in many technologies including ferro-
magnetic and molecular. The molecular QCA is particularly interesting because of
its projected density of up to 1×1012 devices per cm2 [3]. A major advantage of QCA
over other nanoelectronic architectural styles is that the same cells that are used
for making logic gates can be used to build wires carrying logic signals. However,
QCA architectures have to rely upon only two basic building blocks, namely a three
input majority gate and an inverter. As a result, QCA implementations of only a
few logic circuits including binary adders, multipliers, barrel shifters, serial/parallel
converters are currently available [49–51].
Although a majority gate can convert to an AND and OR gate easily by tying
one of its inputs to 0 and 1, respectively so that all digital circuits can be trans-
lated to ones that compose of majority gates only. However, again this way of
direct translation will waste the logic of majority gate as well as one-third of the
interconnects of the circuits. In this chapter, we develop efficient implementation of
85
6.2. QUANTUM-DOT CELLULAR AUTOMATA
key arithmetic circuits of comparison and addition using majority gates only. The
resultant circuits will thus be well suited for QCA implementation.
6.2 Quantum-Dot Cellular Automata
A basic QCA building block can be described as a cell with four quantum dots
and two charged particles may occupy the dots. The charged particles can migrate
between quantum dots when the barriers between them are lowered by an external
clock. When the barriers are raised by lowering the clock, the particles settle into
two possible stable (polarized) positions. These stable positions represent logic 0
and 1 as shown in Fig. 6.1.
(b)(a) (c)
Figure 6.1: (a) The basic QCA cell (b) a polarized QCA cell representing logic 1
and (c) a polarized QCA cell representing logic 0.
When the clock to a QCA cell is lowered, the polarization states of its surround-
ing cells determines its own polarization state. This enables one to design an inverter
and a three input majority gate shown in Figs. 6.2 and 6.3.
A A
Figure 6.2: An inverter implementation in QCA.
A three input majority gate outputs a logic 1 when 2 or more of its inputs are 1.
By fixing one of the inputs to the majority gate at 1, one can convert the majority
86
6.3. COMPARISON FUNCTION
A
C MAJ(A, B, C)
B
B
MAJ(A, B, C)
MAJ
A C
Figure 6.3: A 3-input majority gate implementation in QCA and its symbol.
gate into a 2-input OR gate. Similarly by fixing one of the inputs to 0, one can
turn it into a 2-input AND gate. It therefore appears that it should be possible to
implement any Boolean function in the QCA architecture. However, this produces
very inefficient realizations since about 33% of the gate inputs are tied to constant
values. Further, the large number of gates makes circuit layout a lot harder. It is
therefore important to develop designs that allow one to fully utilize the capabilities
of the 3-input majority gates.
6.3 Comparison Function
Comparison is one of the common function used in many important applications
including realization of arbitrary Boolean functions. In CMOS technology, one can
easily realize an N bit comparison with a carry propagation subtractor having O(N)
delay. This speed can be improved to O(logN) by using a more complex block carry
look-ahead subtractor. However, a further dramatic improvement in speed is only
possible through use of nanoelectronic technologies such as the Quantum-dot cellular
automata (QCA) [26]. In this chapter we describe an efficient implementation of
the comparison function using QCA. The resultant architecture has optimal delay
O(log n) and a low gate complexity O(n) for an n bit comparison. The architecture
can be easily pipelined to improve its throughput.
We now illustrate the use of comparison function in implementing an arbitrary
Boolean function. Consider a Boolean function f(xn−1, xn−2, . . . , x0) which is 1 only
87
6.3. COMPARISON FUNCTION
when the n-bit input string s = xn−1xn−2 · · ·x0 has a value between X and Y . Let
function C(X) denote a comparison function that compares value of string s with
X and outputs a 1 only if value of s is greater than or equal to X. It is then obvious
that f = C(X)C(Y +1). Note that since an inverter as well as a 2-input AND gate
can be realized in the QCA technology, so can the function f if we can design the
comparison function C(·) in QCA. Similarly if f is 1 anytime the input string s is
either between X1 and Y1 or between X2 and Y2, then that function is described as
f = C(X1)C(Y1 + 1) + C(X2)C(Y2 + 1) and realized by QCA. Fig. 6.4 shows this
realization.
C Y1+1)( C( )X2 C(Y +1)2C( )
MAJ MAJ
MAJ
0 0X1
f
1
Figure 6.4: QCA realization of function f which is 1 only when its argument is
between X1 and Y1 or between X2 and Y2. Note that C(·) is a comparison function.
It should thus be clear that any arbitrary Boolean function can be implemented
using the strategy described here. Creating such an implementation requires one
to determine contiguous groups of 1’s in the truth table of the function. The end
points of each group are then compared with the input variable string, and the results
ANDed. Finally, outputs of all the ANDs are added together to get the function.
Clearly, All the comparisons can be done concurrently and all the ANDs can be
concurrent. Thus the resultant Boolean function realization may have a fairly small
depth provided one can find a small depth comparison function implementation.
The comparison function C(B) that compares two n bit stringsB = bn−1bn−2 · · · b0
and s = xn−1xn−2 · · ·x0 can be implemented by subtracting B from s while keeping
track of the carry only. Fig. 6.5 shows a QCA implementation of this strategy.
88
6.3. COMPARISON FUNCTION
......
b b b b
x x x x
C( B)0
0
0 1 2 n−1
1 2 n−1
MAJ MAJ MAJ MAJ
Figure 6.5: A serial realization of comparison C(B) which outputs a 1 when s =
xn−1xn−2 · · ·x0 ≥ bn−1bn−2 · · · b0.
Unfortunately, the architecture in Fig. 6.5 has an O(n) delay. Minimizing this
delay in QCA architectures is important because even in the combinational logic,
unlike CMOS, all the gates in QCA implementations need to be clocked. Further, in
applications such as the Boolean function implementation, the comparator delay can
directly impact the delay of the Boolean function. In order to achieve the optimal
delay, we build the comparison output recursively. Suppose operands X and B are
partitioned as X = [X1|X0] and B = [B1|B0]. X ≥ B is true if X1 > B1 or if
(X1 = B1) and simultaneously (X0 ≥ B0). However, computing equality of X1
and B1 requires XOR gates which are expensive in QCA technology. We therefore
define intermediate logical variables pi = (Xi > Bi) and qi = (Xi ≥ Bi), i = 0 or
1. With this, the output of comparison X ≥ B is given by the Boolean expression
p1+(q1p1)q0 = p1+q1q0. Further, using the fact that p1 = 1 implies q1 = 1, one gets
p1 = p1q1 and q1 = p1 + q1. Thus p1 + q1q0 can be rewritten as p1q1 + p1q0 + q1q0,
which is precisely the output of a three input majority gate with inputs p1, q1 and
q0.
Similarly, the truth value of X > B can also be computed from the intermediate
logical variables pi and qi. In particular, X > B if either X1 > B1 or A1 ≥ B1 and
simultaneously X0 > B0. This can be expressed by the Boolean expression p1+q1p0.
Once more using the fact that p1 = p1q1 and q1 = p1+ q1, the expression for X > B
can be rewritten as p1q1 + (p1 + q1)p0. But this says that X > B can be computed
with a 3-input majority gate with inputs p1, p0 and q1.
89
6.3. COMPARISON FUNCTION
x x xb b bb
__ __ __
__
1 1 10 0 0 01
MAJ MAJ MAJ MAJ MAJ MAJ MAJ MAJ
MAJ MAJ MAJ
MAJ MAJ
MAJ
3 03 02 1 12x
Figure 6.6: A 4-bit comparator architecture which produces a 1 when x3x2x1x0 ≥
b3b2b1b0. (The majority gates in gray need not be implemented.)
To compute pi and qi, the operands Xi and Bi can be partitioned recursively
and the same procedure used. This can be continued till each partition is a single
bit. Fig. 6.6 shows an implementation of a 4 bit comparator in QCA technology.
The number of majority gates used in this realization is 4N − 3− dlog2Ne. The
tree-like structure allows for easy separation of clocking zones in QCA. Further, if
one of the strings, say B, is constant, then all the bi are known in the design and
one can apply 1’s and 0’s, as appropriate at the inputs that expect bi’s.
A two bit comparator architecture layout is illustrated in Fig. 6.7. It should be
noted that for n as small as 2, the proposed strategy actually will have the same
delay but more gates than the sequential strategy of Fig. 6.5. One may note that
the advantage of the new architecture in Fig. 6.6 can be realized only for larger
values of n. It reduces the comparator delay from n to dlog2ne+1, while increasing
the number of majority gates from n to 4n− dlog2ne − 3.
However, as can be seen from Fig. 6.6, the tree architecture has wire crossings
90
6.3. COMPARISON FUNCTION
Figure 6.7: Layout of a 2 bit comparator in QCA technology.
x 0x3 b0x2 b3 x3 b3 x 2 b2 b1 x 1x1b2
MAJ MAJ MAJ MAJ MAJ MAJ MAJ MAJ
MAJMAJ MAJ
MAJ
0 1 01 0 11
Figure 6.8: The same comparator as in Fig. 6.6 after elimination of wire crossings.
at every level of the tree. QCA being a planer architecture, minimizing these wire
crossings is important. To achieve this objective, we duplicate certain majority gates
as shown in Fig. 6.8.
91
6.4. A MAJORITY GATE ADDER
Note that this conversion does not affect the architecture but it completely elim-
inates the crossing of wires from all tree levels except the first. The resultant com-
parators of 2, 4, 8 and 16 bits use only 4, 12, 33 and 89 majority gates respectively.
6.4 A Majority Gate Adder
Derived from the design procedure of LDTA in Chapter 3, an adder made of majority
gates only can be developed. The resultant circuit has a depth of O(logN) and a
complexity of O(N logN).
Note that the LDTA adder uses only three input majority gates in all levels
except the top level. The threshold gates in the top level result from steps 1 and 2
of the procedure given. We now show that these gates can also be decomposed in
3-input majority gates.
Step 1 of the procedure to compute ci, 0 ≤ i < M/2 can be modified to compute
each of these ci from ci−1. Similarly, in step 2, one computes G and T over the
range [jM/2 + k : jM/2], 0 ≤ k < M/2, 1 ≤ j < 2N/M directly from operand bits
over these ranges. Instead, these may be obtained by combining G and T over the
ranges [jM/2 + k − 1 : jM/2] and [jM/2 + k, jM/2 + k]. Note that G and T over
the range [jM/2 + k, jM/2 + k] are simply ajM/2+kbjM/2+k and ajM/2+k + bjM/2+k
respectively. Thus, using (3.22) one gets,
GjM/2+k:jM/2 = (ajM/2+kbjM/2+k) + (ajM/2+kbjM/2+k)GjM/2+k−1:jM/2
= MAJ(ajM/2+k, bjM/2+k, GjM/2+k−1:jM/2).
The last step in this equation is obtained from Theorem 1(P6). Similarly, using
(3.24),
TjM/2+k:jM/2 = (ajM/2+kbjM/2+k) + (ajM/2+kbjM/2+k)TjM/2+k−1:jM/2
= MAJ(ajM/2+k, bjM/2+k, TjM/2+k−1:jM/2).
Steps 1 and 2 of the LDTA design can now be modified to read:
92
6.5. CONCLUSION
1. Obtain ci =MAJ(ai, bi, ci−1, 0 ≤ i < M/2.
2. For 1 ≤ j < 2N/M , obtainGjM/2:jM/2 =MAJ(ajM/2, bjM/2, 0) and TjM/2:jM/2 =
MAJ(ajM/2, bjM/2, 1). Then, for each 1 ≤ j < 2N/M , obtain for 0 ≤ k <
M/2, GjM/2+k:jM/2 = MAJ(ajM/2, bjM/2, GjM/2+k−1:jM/2) and TjM/2+k:jM/2 =
MAJ(ajM/2, bjM/2, TjM/2+k−1:jM/2).
With this transformation, the repetition of inputs in the top level threshold gates
that was present in the earlier LDTA is eliminated. This results in a smaller number
of total inputs. This result is summarized by the following theorem.
Theorem 23 One can compute all the carries of an N bit adder using only 3-input
majority gates. This circuit has a depth of log2(2N/M) + (M/2) and a hardware
complexity of 3N log2(4N/M)− (4N/M) + 2 inputs.
Fig.6.9 shows the carry computation circuit of an 8-bit LDTA using only majority
gates. Note that the 3-input majority gates with one input 0 or 1 are really two
input ANDs and ORs respectively.
Since the carry computational circuit of this modified LDTA uses only 3-input
majority gates, the fan-in of the circuit is no longer a limitation as long as it is
3 or larger. Thus in this design, M merely serves as a parameter that allows one
to trade off the depth with the hardware complexity in implementations using the
same 3-input majority gates. Smaller M is associated with a smaller depth but a
larger complexity and larger M implies a smaller complexity but a larger depth. In
particular, when M = 2N , the adder degenerates to a CPA.
6.5 Conclusion
This chapter provides new QCA architectures for the comparison function. This
architecture reduces the delay of an n bit comparison to O(log n) while maintaining
the gate complexity to O(n). Using this comparison function, one may be able to
obtain better (low depth) implementations of some Boolean functions.
93
6.5. CONCLUSION
MAJ MAJ MAJ MAJ MAJ
MAJ MAJ MAJ MAJ
a , b a , b a , b a , b
a , b a , b a , b a , b
MAJ
MAJ MAJ MAJ MAJ MAJ MAJ
MAJ
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
c c c
c c
c
c
c
0
1
23
4567
c
−1
MAJ
AAA BBB
Figure 6.9: Calculating carries in an 8-bit LDTA using only 3 input majority gates.
Also this chapter derives from the design procedure of LDTA in Chapter 3 to
develop an architecture of a low depth majority threshold adder. The architecture
has a delay of O(logN) and a complexity of O(N logN). The LDMA architecture
has similar delay and complexity as the LDTA. Its two major advantages over LDTA
are that it only uses majority gates (with fan-in 3 or 5) and it is able to trade off the
delay for the complexity by varying the parameterM in the design. For example, in
a 64 bit LDMA, by choosing M = 8 rather than M = 4, one can decrease the gate
count from 448 to 384 and the input complexity from 1410 to 1240 while increasing
the adder delay from 8 to 9. When M = 2N , LDMA degenerates into a CPA.
94
Chapter 7
Conclusions
This dissertation has focused on computer architecture design using nanotechnology.
For reliability purpose, all digital circuits that are developed in this work compose
of threshold gates with bounded fan-in. We have provided general decomposition
strategies that decompose any threshold gate into a network of threshold gates with
bounded fan-in. We have also developed design algorithms for specific arithmetic
blocks such as adders with interesting features of low depth, low complexity and
flexibility to trade-off these parameters. Different implementation styles are also
discussed in this dissertation. We have developed a novel systolic implementation
for threshold function decomposition and combinational logic implementations that
use only majority gates as demanded by the QCA nanotechnology.
Our general decomposition of any threshold function is an extension of the work
of [44]. Instead of using a binary tree decomposition structure as described in [44],
we generalize the structure to a k-ary tree. Though we use the same definitions and
the critical concepts such as the error, the critical error, fragments, recombiners,
our work extends the the hardware reduction Theorems to the k-ary tree structure.
An application of these new concepts to the comparison function shows that the
new architecture based on k-ary tree decomposes the function into a network of a
O(logN) depth and a O(N) gate count and interconnect complexity. The compari-
son also shows that this generalization of decomposition scheme provides a flexibility
of design in the manner of choosing the suitable k according to the fan-in bound to
95
obtain a circuit with optimum depth and hardware complexity.
This work has discussed designs for the adder extensively. Adder is a key arith-
metic unit in any computer architecture. We have developed a general design strat-
egy for adders with the introduction of a new logic primitive that eliminates the
complexity of implementing XOR functions with threshold gates. The strategy can
be used to obtain adder architectures with low depth, low hardware complexity, or
a trade-off between the two using different ways of combining the new primitives.
We have illustrated the application of the strategy by developing low depth adders
(LDTA) and low complexity adders (LCTA). We also have obtained an enhanced low
depth adder (ELDTA), that simulateously minimizes the complexity and the delay.
We have compared ELDTA with the two common conventional adders, the carry
propagation adder (CPA) and the group carry look-ahead adder (GCLA). As ex-
pected, this comparison with GCLA, also implemented with threshold gates, shows
that the complexity and delay of ELDTA are both lower by about 40% as compared
to that of GCLA. The CPA complexity is logarithmically lower than that of the
ELDTA (O(N) versus O(N log(N)) where N denotes the adder length). However,
as far as the delay is concerned, ELDTA is far superior (O(logN) as against O(N))
to CPA.
A systolic implementation scheme of the recombiners of a decomposition circuits
is developed for applications requiring very low hardware complexity. Unfortunately
the speed is sacrificed due to the serial output. It needs to be pointed out that there
are no results in the current literature where sequential logic is implemented using
nanoelectronic devices. This is because of the loops in sequential logic which run
counter to the four phase clocking scheme mandated in all the namoelectronic imple-
mentation of digital gates. In order to successfully create the systolic design, we had
to modify the clocking to a new six phase scheme. In the application to decomposi-
tion, We use a novel recombining scheme combines errors from two fragments at the
first systolic stage and add errors from one more fragment in each of the following
systolic stage. Both the hardware complexity and the time complexity of this ar-
chitecture are directly related to the critical error of the threshold function and the
bound of the fragments. Thus the systolic implementation scheme is best suitable
96
7.1. FUTURE WORK
for threshold functions with smaller weights. A generalized majority function and a
pattern matching function is shown with the systolic implementation. With certain
increase in the time complexity, the gate counts are tremendously saved using the
systolic system.
Finally, this dissertation has also developed better designs for some applications
using only three and five input majority gates. Such designs are immediately appli-
cable to QCA technology.
The design and implementation strategies developed in this dissertation are not
limited to the applications that are provided but to a boarder general digital world.
To summarize, this work develops reliable design schemes suitable for all digital
circuits implemented in nanotechnology.
7.1 Future Work
With the design and implementation schemes that have been developed in this work,
we believe that it is possible to develop flexible design algorithms and implemen-
tation styles following the methodology. Adder architecture design initiates our
interest in developing design algorithms for addition related functions such as mul-
tiplication. One can expect to extend the design tool developed in this work to
those that are suitable for more complex arithmetic functions. One more interesting
future work is the design and implementation of sequential logic using nanotechnol-
ogy. In this work a new clocking scheme is introduced that breaks the original design
paradigm of nanotechnology circuits when data flow only goes in one direction. This
could prove to be a new direction to explore sequential logic implementation with
nanotechnology.
97
Bibliography
[1] V. Beiu, J. Peperstraete, J. Vandewalle, and R. Lauwereins, “Efficient decompo-
sition of comparison and its applications,” in In M. Verleysen (ed.): European
Symp. Artif. Neural Networks ESANN’93, (Dfacto, Brussels), pp. 45–50, April
1993.
[2] V. Beiu, J. Peperstraete, J. Vandewalle, and R. Lauwereins, “Overview of some
efficient threshold gate decomposition algorithms,” in Proc. of 9th Intl. Conf.
Control Systems and Comp. Sci. CSCS’93, (Bucharest, Romania), pp. 458–469,
May 1993.
[3] “The international technology roadmap for semiconductors: Emerging research
devices.” http://www.itrs.net/, 2005.
[4] C. Pacha and K. Goser, “Design of arithmetic circuits using resonant tunnel-
ing diodes and threshold logic,” in Proc. of the 2nd Workshop on Innovative
Circuits and Systems for Nanoelectronics, (Delft, NL), pp. 83–93, Sep. 1997.
[5] C. Lageweg, S. Cotofana, and S. Vassiliadis, “A linear threshold gate imple-
mentation in single electron technology,” in Proc. IEEE-CS Annual Workshop
on VLSI, (Orlando, FL), pp. 93–98, Apr. 2001.
[6] A. Schmid and Y. Leblebici, “Robust circuit and system design methodologies
for nanometer-scale devices and single-electron transistors,” IEEE Trans. on
VLSI Systems,, vol. 12, pp. 1156–1166, Nov. 2004.
98
BIBLIOGRAPHY
[7] I. Amlani, A. O. Orlov, G. Toth, G. H. Bernstein, C. S. Lent, and G. L. Snider,
“Digital logic gate using quantum-dot cellular automata,” Science, vol. 284,
pp. 289–291, April 1999.
[8] N. J. A. Sloane and S. Plouffe, The Encyclopedia of Integer Sequences. Academic
Press, 1995.
[9] W. Prost, U. Auer, F.-J. Tegude, C. Pacha, K. F. Goser, G. Janssen, and
T. van der Roer, “Manufacturability and robust design of nanoelectronic logic
circuits based on resonant tunnelling diodes,” Int. J. Circ. Theor. Appl., vol. 28,
pp. 537–552, 2000.
[10] J. G. Guimaraes, H. C. Carmo, and J. C. da Costa, “Basic subcircuits with
single-electon tunneling devices,” in Proc. of 17th Symp. on Technology and
Devices, 2002.
[11] C. Lageweg, S. Cotofana, and S. Vassiliadis, “Evaluation methodology for single
electron encoded threshold logic gates,” in Proc. Int. Conf. on Very Large Scale
SystemsSystems on Chip, pp. 258–262, Dec. 2003.
[12] Y. Sun and M. D. Wagh, “A fan-in bounded low delay adder for nanotechnol-
ogy,” in Proc. of 2010 NanoTech Conf., vol. 2, (Anaheim, CA), pp. 83–86, July
2010.
[13] M. D. Wagh, Y. Sun, and V. Annampedu, “Implementation of comparison
function using quantum-dot cellular automata,” in Proc. of 2008 NanoTech
Conf., vol. 3, vol. 3, (Boston. MA), pp. 76–79, June 2008.
[14] K. Goser and C. Pacha, “System and circuit aspects of nanoelectronics,” in
24th European Solid-State Circuits Conf., (The Hague, NL), pp. 18–29, Sep.
1998.
[15] D. Goldhaber-Gordon, M. S. Montemerlo, J. C. Love, G. J. Opteck, and J. C.
Ellenbogen, “Overview of nanoelectronic devices,” tech. rep., MITRE Corpo-
ration, McLean, Virginia, Mar. 1997.
99
BIBLIOGRAPHY
[16] “Definition of software interfaces.” ANSWERS Tech Report (Autonomous Na-
noelectronic Systems With Extended Replication and Signalling), Jan. 1999.
[17] A. Sellai, H. Al-Hadhrami, S. Al-Harthy, and M. Henini, “Resonant tunnel-
ing diode circuits using PSPICE,” Microelectronics Journal, vol. 34, no. 5–8,
pp. 741–745, 2003.
[18] Z. Yan and M. Deen, “New RTD large-signal DC model suitable for PSPICE,”
IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 14, pp. 167–171,
Feb. 1995.
[19] C. Moffat, “The resonant tunnelling transistor,” tech. rep., Image Processing
Group, University of College London, July 1996.
[20] K. Maezawa and T. Mizutani, “A new resonant tunneling logic gate employing
monostable-bistable transition,” Japan J. Appl. Phys.,, vol. 32, pp. L42–L44,
1993.
[21] K. J. Chen, K. Maezawa, and M. Yamamoto, “InP-based high performance
monostable-bistable transition logic elements (MOBILE’s) using integrated
multiple-input resonant-tunneling devices,” IEEE Elec. Dev. Lett., vol. 17,
pp. 127–129, Mar. 1996.
[22] T. Akeyoshi, K. Maezawa, and T. Mizutani, “Weighted sum threshold logic
operation of MOBILE (monostable-bistable transition logic element) using
resonant-tunneling transistors,” IEEE Elec. Dev. Lett., vol. 14, pp. 475–477,
Oct. 1993.
[23] C. Pacha, K. Goser, A. Brennemann, and W. Prost, “A threshold logic full
adder based on resonant tunneling transistors,” 24th European Solid-State Cir-
cuits Conf., pp. 428–431, 1998.
[24] W. Prost, U. Auer, J. Degenhardt, A. Brennemann, C. Pacha, K. F. Goser,
and F.-J. Tegude, “A depth-2 full-adder circuit using the InP RTD/HFET
100
BIBLIOGRAPHY
MOBILE,” in Proc. Indium Phosphide and Related Materials Conference
(IPRM’01), pp. 5045–5046, May 2001.
[25] W. Prost, U. Auer, F. J. Tegude, C. Pacha, K. F. Goser, R. Duschl, K. Eberl,
and O. G. Schmidt, “Tunnelling diode technology,” in 31st IEEE Intl. Symp.
on Multiple Valued Logic, pp. 49–58, May 2001.
[26] C. S. Lent, P. D. Tougaw, W. Porod, and G. H. Bernstein, “Quantum cellular
automata,” Nanotechnology, vol. 4, pp. 49–57, Jan. 1993.
[27] G. L. Snider, A. O. Orlov, I. Amlani, G. H. Bernstein, C. S. Lent, J. L. Merz,
andW. Porod, “Quantum-dot cellular automata: Line and majority logic gate,”
Jpn. J. Appl. Phys, vol. 38, pp. 7227–7229, 1999.
[28] A. O. Orl, I. Amlani, G. Toth, C. S. Lent, G. H. Bernstein, and G. L. Snider,
“Experimental demonstration of a binary wire for quantum-dot cellular au-
tomata,” Applied Physics Letters, vol. 74, pp. 2875–2877, May 1999.
[29] J. R. Janulis, P. D. Tougaw, S. C. Henderson, and E. W. Johnson, “Serial bit
stream analysis using quantum-dot cellular automata,” in IEEE Transactions
on Nanotechnology, vol. 3, pp. 158–164, Mar. 2004.
[30] S. Roy and B. Saha, “Minority gate oriented logic design with quantum-dot cel-
lular automata,” in Cellular Automata, 7th International Conference on Cel-
lular Automata, for Research and Industry, ACRI 2006, Perpignan, France,
September 20-23, 2006, Proceedings, vol. 4173 of Lecture Notes in Computer
Science, pp. 646–656, 2006.
[31] S. Muroga, Threshold logic and its applications. New York: Wiley-Interscience,
1971.
[32] V. Beiu, J. M. Quintana, and M. J. Avedillo, “VLSI implementations of thresh-
old logic - a comprehensive survey,” IEEE trans. Neural Networks, vol. 14,
pp. 1217–1243, Sep. 2003.
101
BIBLIOGRAPHY
[33] “Definition of software interfaces.” ANSWERS (Autonomous Nanoelectronic
Systems With Extended Replication and Signalling) Project Report, Jan. 1999.
[34] P. Gupta and N. K. Jha, “An algorithm for nano-pipelining of RTD-based
circuits and architectures,” IEEE Trans. on Nanotechnology, vol. 4, pp. 159–
167, Mar 2005.
[35] M. Goldmann and M. Karpinski, “Simulating threshold circuits by majority
circuits,” SIAM J. Computing, vol. 27, pp. 230–246, Feb 1998.
[36] E. Allender, “Circuit complexity before the dawn of the new millennium,” in
Lecture Notes in Computer Science, vol. 1180, pp. 1–18, Springer–Verlag, 1996.
[37] A. Maciel and D. The´rien, “Threshold circuits of small majority-depth,” Info.
and Computation, vol. 146, pp. 55–83, Oct 1998.
[38] K.-Y. Siu and J. Bruck, “On the power of threshold circuits with small weights,”
SIAM J. on Disc. Math., vol. 4, pp. 423–435, Aug 1991.
[39] N. Alon and J. Bruck, “Explicit construction of depth-2 majority circuits for
comparison and addition,” SIAM J. Discrete Math, vol. 7, no. 1, pp. 1–8, 1994.
[40] S. Cotofana and S. Vassiliadis, “Signed digit addition and related operations
with threshold logic,” IEEE Trans. on Computers, vol. 49, no. 3, pp. 193–207,
2000.
[41] R. Katz, Contemporary Logic Design. Benjamin/Cummings, 1994.
[42] V. Annampedu and M. D. Wagh, “Approximate pattern matching in nanotech-
nology,” in Proc. of Nanotech 2006, vol. 3, (Boston, MA), pp. 316–319, May
7–11 2006.
[43] G. S. Glinski and C. K. Yue, “Decomposition of n-variable threshold function
into p-variable threshold functions, where p < n,” Tech. Rep. 63-10, Dept. of
EE, Univ. of Ottawa, Canada, June 1963.
102
BIBLIOGRAPHY
[44] V. Annampedu and M. D. Wagh, “Reconfigurable approximate pattern match-
ing architectures for nanotechnology,” Microelectronics, vol. 38, pp. 430–438,
2007.
[45] V. Bohossian, M. Riedel, and J. Bruck, “Trading weight size for circuit depth:
An L̂T 2 circuit for comparison,” Tech. Rep. Paradise, ETR028, California In-
stitute of Technology, Nov. 1998.
[46] P. A. Jackson, C. P. Chan, J. E. Scalera, C. M. Rader, and M. M. Vai, “A
systolic fft architecture for real time fpga systems,” tech. rep., MIT Lincoln
Laboratory, Lexington, MA, 2005.
[47] G. Sal and M. Ari, “Fpga-based customizable systolic architecture for image
processing applications,” in Reconfigurable Computing and FPGAs, 2005. In-
ternational Conference on, pp. 8 pp. –3, Feb. 2005.
[48] H. T. Kung, “Why systolic architectures?,” Computer Magazine, vol. 15,
pp. 37–46, Jan. 1982.
[49] K. Walus, G. A. Jullien, and V. S. Dimitrov, “Computer arithmetic structures
for quantum cellular automata,” in Proc. 37th Asilomar Conf. Signals, Systems
and Computers, (Pacific Grove, CA), pp. 9–12, Nov. 9–12 2003.
[50] I. Hanninen and J. Takala, “Binary multipliers on quantum-dot cellular au-
tomata,” Facta Universitatis Ser.: Elec. Energ., vol. 20, pp. 541–560, December
2007.
[51] H. Cho and E. E. Swartzlander, “Adder designs and analyses for quantum-
dot cellular automata,” IEEE Trans. Nanotechnology, vol. 6, pp. 374–383, May
2007.
103
Vita
Yichun Sun was born in Shanghai, China on June 24th, 1983. She received her
Bachelor of Engineering degree in Electrical Engineering from Shanghai Jiaotong
University, in June 2005. She obtained her Master of Science degree in Electrical
Engineering from Lehigh University, in May 2007.
Since August 2005, Yichun has been studying Electrical Engineering at Lehigh
University and since 2006, she has worked as research assistant under the supervision
of Dr. Meghanad. D. Wagh. She has worked on computer architectures and digital
design with nanotechnology. From August 2006 to May 2008, she has also worked
as teaching assistant in the Electrical Engineering department at Lehigh University.
She has also worked as internship at LSI Corporation since May 2011 on IC design
an verification.
She has 2 conference papers and presented them at Nanotech conferences.
104
