A configurable decoder for pin-limited applications by Jordan, Matthew Collin
Louisiana State University
LSU Digital Commons
LSU Master's Theses Graduate School
2006
A configurable decoder for pin-limited applications
Matthew Collin Jordan
Louisiana State University and Agricultural and Mechanical College, mjorda6@lsu.edu
Follow this and additional works at: https://digitalcommons.lsu.edu/gradschool_theses
Part of the Electrical and Computer Engineering Commons
This Thesis is brought to you for free and open access by the Graduate School at LSU Digital Commons. It has been accepted for inclusion in LSU
Master's Theses by an authorized graduate school editor of LSU Digital Commons. For more information, please contact gradetd@lsu.edu.
Recommended Citation
Jordan, Matthew Collin, "A configurable decoder for pin-limited applications" (2006). LSU Master's Theses. 1842.
https://digitalcommons.lsu.edu/gradschool_theses/1842
A CONFIGURABLE DECODER FOR PIN–LIMITED
APPLICATIONS
A Thesis
Submitted to the Graduate Faculty of the
Louisiana State University and
Agricultural and Mechanical College
in partial fulfillment of the
requirements for the degree of
Master of Science in Electrical Engineering
in
The Department of Electrical and Computer Engineering
by
Matthew Collin Jordan
B.S., Michigan Technological University, 2004
December 2006
Acknowledgments
First, I would like to thank Dr. Ramachandran Vaidyanathan for his patience, guidance, and
advice throughout the course of the past two years. His assistance has not only developed
the research in this thesis, but also the abilities I have as a student and engineer. I would
also like to thank the contributions of the members of the committee, Dr. Jerry Trahan and
Dr. Suresh Rai, which have greatly helped in the development of the thesis. Finally, I would
like to thank my parents and my siblings. Without their support, none of this research would
have been possible. This thesis is dedicated to them.
ii
Table of Contents
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
Chapter
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Pin Limitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1 Pin Limitation in Reconfigurable Architectures . . . . . . . . . . . . . . . . . 7
2.1.1 The R-Mesh: A Theoretical Reconfigurable Model . . . . . . . . . . . 8
2.1.2 Field Programmable Gate Arrays . . . . . . . . . . . . . . . . . . . . 11
2.2 Approaches for the Pin Limitation Constraint . . . . . . . . . . . . . . . . . 17
3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1 Assumptions and Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.1 Performance Parameters . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.2 Other Notation and Concepts . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Building Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.1 Fan-in and Fan-out . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.2 Fixed Decoders – 1-hot Decoders . . . . . . . . . . . . . . . . . . . . 25
3.2.3 Multiplexers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.4 Look-up Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.5 Shift Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.6 Modulo-α Counter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3 Configurable Decoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4 The Mapping Unit: Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1 A General View of the Mapping Unit . . . . . . . . . . . . . . . . . . . . . . 39
4.1.1 Functional Description of the Mapping Unit . . . . . . . . . . . . . . 41
4.1.2 Constructing Ordered Partitions for a Mapping Unit . . . . . . . . . 42
4.2 Number of Subsets Produceable by MU (z,y,n,α) . . . . . . . . . . . . . . . . 46
4.2.1 Number of Independent Subsets . . . . . . . . . . . . . . . . . . . . . 47
4.2.2 Total Number of Subsets . . . . . . . . . . . . . . . . . . . . . . . . . 49
5 The Mapping Unit: Realizations . . . . . . . . . . . . . . . . . . . . . . . . 54
5.1 Fixed Mapping Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.2 Reconfigurable Mapping Units . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.3 Bit-Slice Mapping Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
iii
6 A Configurable Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.1 Illustrative Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.2 Performance of CD(x,z,y,n,α) . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.3 Gate-Cost Constrained Configurable Decoders . . . . . . . . . . . . . . . . . 79
7 Implementations of Useful Subsets . . . . . . . . . . . . . . . . . . . . . . 84
7.1 Binary Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.2 ASCEND/DESCEND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.3 1-Hot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
8 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
8.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
8.2 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
8.2.1 Integral Decoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
8.2.2 Bit-slice Decoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
8.3 Regression Analysis Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
9 Parallel Configurable Decoder . . . . . . . . . . . . . . . . . . . . . . . . 115
9.1 An Illustrative Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
9.2 General Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
10 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
10.1 Other Configurable Decoder Variants . . . . . . . . . . . . . . . . . . . . . . 122
10.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
Vita . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
iv
List of Tables
2.1 Intel microprocessor characteristics, 1971–2001 . . . . . . . . . . . . . . . . . . . 6
2.2 An illustration of a decoder for four sets Si of subsets of Z8 . . . . . . . . . . . 18
3.1 Asymptotic gate cost and delay of building blocks . . . . . . . . . . . . . . . . . 24
3.2 Two possible 1-hot bit patterns for z = 3, n = 8 . . . . . . . . . . . . . . . . . . 26
4.1 Sets of subsets of Z8 for Example 4.1. . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 Partition pii,j for subsets S
i
j of Table 4.1 . . . . . . . . . . . . . . . . . . . . . . 43
4.3 Mapping unit values used to produce the sets in Example 4.1 . . . . . . . . . . . 45
4.4 Two different orderings for the partitions of sets S0 and S1 in Example 4.1 re-
sulting in different sets of source strings used to produce the subsets in each
set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.5 A set of log z subsets of Z16, where the number of blocks induced by the product
of the partitions of the subsets has z = 8 blocks. . . . . . . . . . . . . . . . . . . 48
5.1 Sets of n-subsets (n = 8, z = 4) used for fixed mapping units in Figures 5.3 and 5.5 60
5.2 Configuration LUT words to produce the subsets from Table 5.1 . . . . . . . . . 62
5.3 Ordered partition patterns for an RMU resulting from the configuration LUT
words of Table 5.2 and the hardwiring shown in Figure 5.3. . . . . . . . . . . . . 64
5.4 Subsets with repeated patterns for n = 16, α = 4 . . . . . . . . . . . . . . . . . 67
6.1 Sets S0 and S1 with corresponding partitions . . . . . . . . . . . . . . . . . . . . 73
6.2 Input values needed for the configurable decoder to produce the subsets of S0 and
S1 in Table 6.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.3 Subsets produced by combining source strings of S0 (resp., S1) with partition of
~pi1 (resp., ~pi0) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.4 Possible subsets produceable from µ(uj, ~pi0) and µ(uj, ~pi1) . . . . . . . . . . . . . 76
6.5 Sets S0 and S1 of Z16 for Example 6.3 . . . . . . . . . . . . . . . . . . . . . . . 77
6.6 n
α
-bit strings produced from
⌈
z
α
⌉
-bit input strings in CD(x,z,y,n,α) . . . . . . . 78
v
7.1 Two binary tree based reduction patterns . . . . . . . . . . . . . . . . . . . . . 86
7.2 Partitions and source-strings generated for ASCEND/DESCEND bit patterns;
for n = 8 and z = 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
7.3 A set of 1-hot subsets of Z16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
8.1 Parameter values for a configurable decoder with FMU, for G = n logn . . . . . 93
8.2 Parameter values for a configurable decoder with fixed mapping unit (CDF) . . 93
8.3 Parameter values for a λ` × n LUT . . . . . . . . . . . . . . . . . . . . . . . . . 94
8.4 Parameter values for a configurable decoder with reconfigurable mapping unit . 94
8.5 Parameter values for a universal configurable decoder with reconfigurable map-
ping unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
8.6 Integral decoder delays [ns] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
8.7 Integral decoder areas [µm2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
8.8 Integral decoder power consumptions [mW] . . . . . . . . . . . . . . . . . . . . 102
8.9 LUT areas [µm2] in a bit-slice configurable decoder . . . . . . . . . . . . . . . . 106
8.10 Mapping unit areas [µm2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
8.11 Area (µm2) for mod-α counter and shift registers, 2 ≤ log n < 256, 1 ≤ logα < 6 107
8.12 Bit-slice CDF area (µm2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
8.13 Bit-slice CDR area (µm2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
8.14 Bit-slice Univ. area (µm2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
8.15 Bit-slice F-Univ. area (µm2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
8.16 Functions used in regression analysis for each module . . . . . . . . . . . . . . . 113
8.17 Constants found from regression analysis for each module . . . . . . . . . . . . . 113
9.1 Subsets qi,0 and qi,1 for n = 20 and m = 4 . . . . . . . . . . . . . . . . . . . . . 116
vi
List of Figures
1.1 Proposed configurable decoder overview . . . . . . . . . . . . . . . . . . . . . . 3
2.1 Normalized transistor and pin counts for 1971 – 2001 . . . . . . . . . . . . . . . 7
2.2 A 3 × 5 R-Mesh with all possible port connections; one bus is shown in bold. . . 9
2.3 Finding the number of flagged groups on a one-dimensional R-Mesh . . . . . . . 10
2.4 A typical FPGA structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5 Xilinx Virtex-5 configurable logic block . . . . . . . . . . . . . . . . . . . . . . . 12
2.6 Reconfiguration of an 8× 12 FPGA of Example 2.1 . . . . . . . . . . . . . . . . 15
2.7 Reconfiguration of an 8× 12 FPGA of Example 2.2 . . . . . . . . . . . . . . . . 15
2.8 An x to n decoder in an IC chip . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1 Fan-in problem (a) and fan-out problem (b) of degree f and width z . . . . . . 25
3.2 A fixed z to n decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 A logic circuit for a 4 to 16 1-hot decoder. Note the use of an enable signal to
force the output of the decoder to ∅. . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4 General implementation of a 1-hot decoder . . . . . . . . . . . . . . . . . . . . . 29
3.5 Multiplexer block diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.6 A 4 to 1 multiplexer circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.7 Look-up table block diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.8 Look-up table implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.9 An α-position shift register of width z
α
. . . . . . . . . . . . . . . . . . . . . . . 33
3.10 An implementation of a SR(α, z
α
) . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.11 A modulo-α counter block diagram . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.12 A mod-2d counter with truth table . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.13 Circuit for bit i of a synchronous counter . . . . . . . . . . . . . . . . . . . . . . 36
vii
3.14 A modulo-α counter implementation using a synchronous counter and a mask
computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.15 A configurable decoder block diagram . . . . . . . . . . . . . . . . . . . . . . . . 38
4.1 A mapping unit decoder block diagram . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 Multicasts of 4-bits to 8-bits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3 Division of an n-bit quantity into χ+ 1 buckets of at most (z − 1) contiguous bits 50
4.4 Assignment of source string bits to bucket indices . . . . . . . . . . . . . . . . . 51
4.5 Mapping of a source string to bucket Bi under two different ordered partitions
~pi1, ~pi2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.1 Block diagram of a mapping unit MU (z,y,n,α) . . . . . . . . . . . . . . . . . . . 54
5.2 Classification of mapping unit realizations . . . . . . . . . . . . . . . . . . . . . 54
5.3 A fixed mapping unit MU (4,2,8,1) that produces S0 and S1 in Table 4.1 . . . . 56
5.4 General structure of a fixed mapping unit; signals B0, B1, . . . , Bn−1 are discussed
later . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.5 A fixed mapping unit MU (4,4,8,1) that produces all subsets in Table 5.1 . . . . 59
5.6 A reconfigurable mapping unit MU (z,y,n,α) . . . . . . . . . . . . . . . . . . . . 61
5.7 Bit-slice mapping unit implementation . . . . . . . . . . . . . . . . . . . . . . . 68
6.1 Block diagram of a configurable decoder CD(x,z,y,n,α) . . . . . . . . . . . . . . 71
7.1 Two binary tree reductions of n = 8 elements . . . . . . . . . . . . . . . . . . . 84
7.2 ASCEND/DESCEND communication pairs for n = 8 . . . . . . . . . . . . . . . 85
8.1 Block diagrams of all decoders simulated, (a) 1-hot, (b) pure LUT-based, (c)
configurable decoder with FMU, and (d) configurable decoder with RMU . . . . 92
8.2 Simulation process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
8.3 Wire distributions in simulated mapping units. . . . . . . . . . . . . . . . . . . 98
8.4 Integral decoder delays [ns] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
8.5 Integral decoder areas [µm2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
viii
8.6 Integral decoder recalculated areas [µm2] . . . . . . . . . . . . . . . . . . . . . . 104
8.7 Integral decoder power consumption [mW] . . . . . . . . . . . . . . . . . . . . . 105
8.8 Mapping unit area (µm2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
8.9 Bit-slice CDF area [µm2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
8.10 Bit-slice CDR area [µm2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
8.11 Bit-slice Univ. area [µm2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
8.12 Bit-slice F-Univ. area [µm2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
8.13 Integral decoder expected area (µm2) under regression analysis . . . . . . . . . . 114
9.1 A parallel configurable decoder that generates the 1-hot subset of Zn . . . . . . 117
9.2 Hardwired partitions in the parallel configurable decoder . . . . . . . . . . . . . 118
9.3 A parallel configurable decoder CD(x,z,y,n,α,P ) . . . . . . . . . . . . . . . . . . 119
10.1 A serial configurable decoder variant . . . . . . . . . . . . . . . . . . . . . . . . 122
10.2 A conceptual view of a recursive bit-slice configurable decoder. Note that αi =
α0α1 . . . αi−1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
ix
Abstract
Pin limitation is the restriction imposed on an IC chip by the unavailability of a sufficient
number of I/O pins. This impacts the design and performance of the chip, as the amount
of information that can be passed through the boundary of the chip becomes limited. One
area that would benefit from a reduction of the effect of pin limitation is reconfigurable
architectures. In this work, we consider reconfigurable devices called Field Programmable
Gate Arrays (FPGAs). Due to pin limitation, current FPGAs use a form of 1-hot decoder to
select elements (one frame at a time) during partial reconfiguration. This results in a slow and
coarse selection of elements for reconfiguration. We propose a module that performs a focused
selection of only those elements that require reconfiguration. This reduces reconfiguration
overheads and enables the speeds needed for dynamic reconfiguration.
The problem is that of selecting subsets of an n-element set in a fast, focused and in-
expensive manner. This thesis proposes such a configurable decoder that bridges the gap
between the inexpensive, but inflexible, fixed 1-hot decoder, and the expensive, but flexible,
pure LUT-based decoder. Our configurable decoder uses a LUT with a narrow output and
a low cost in tandem with a special fixed decoder called a mapping unit that expands the
output of the LUT to a desired n-bit output. We demonstrate several implementations of
the mapping unit, each with different capabilities and trade-offs. A key result of this work is
that for any gate cost G = O(n logk n) (where k is a constant), if a pure LUT-based solution
produces λ independent subsets, then our method produces Ω
(
λ logn
log logn
)
independent subsets
for the same cost. Our decoder also produces many more dependent subsets (that depend
on the choice of the Ω
(
λ logn
log logn
)
independent subsets).
We provide simulation results for the configurable decoder and predict future trends from
the simulation data; these confirm the theoretical advantages of the proposed decoder. We
illustrate the implementation of important subset classes on our configurable decoder and
make key observations on a generalized variant.
x
Chapter 1
Introduction
Over time, as processor speeds increased faster than the rate at which information could
enter and exit a chip, in many cases, it was found that increasing processor speed while
ignoring the effects of I/O produced little results [14] — essentially, if information cannot
get into or out of the chip at a fast enough rate, then CPU speed diminishes in importance.
This implies that modern chips benefit from a high rate of data transfer between the inside
and outside of the chip. This data transfer can be improved by increasing the bit rate and/or
the number of I/O pins. Since pins cannot be miniaturized to the same extent as transistors
(pins must be physically strong enough to withstand contact), the rate at which the number
of transistors on a chip has increased far outpaces the rate at which the number of pins on a
chip has increased [26]. For example, in Intel microprocessors, the number of transistors has
increased by a factor of 20,000 in the last 30 years, whereas the number of pins in these chips
increased merely by a factor of 30 [15]. Therefore, the rate at which a chip can generate and
process information is much larger than the available conduit to convey this information.
The restriction imposed by the unavailability of a sufficient number of pins in a chip is called
pin limitation.
This thesis proposes a method to alleviate the pin limitation constraint in IC chips.
Past research in this area approached the problem by increasing the number of pins on the
chip [25]. Others have proposed methods to change the application or chip functionality to
reduce pin requirements [12, 21]. In modern ICs, however, there seems to be little room
for increasing the number of pins that can be physically placed on a single chip without a
substantial change in technology. We seek a way to make better use of the available pins
without altering the functionality of the underlying chip.
One area that would benefit from a reduction of the effects of pin limitation is recon-
figurable architectures, in particular, Field Programmable Gate Arrays (FPGAs) [5, 9]. An
FPGA is an array of programmable logic elements, all of which must be configured to suit the
1
application at hand. Since FPGA elements are simple logic blocks, information to configure
the chip must come from outside the chip; given that a modern FPGA can require over 70
million bits for a single configuration encompassing the entire chip [29], a dearth of pins to
input this information can have severe time consequences.
FPGAs have evolved out of being simple electronic breadboards to being a competitor
to Application Specific Integrated Circuits (ASICs) and even microprocessors in many ap-
plications [13]. A number of applications benefit greatly from a technique called dynamic
reconfiguration, in which elements of the FPGA chip are reconfigured (to alter their intercon-
nections and functionality) while the application is executing on the FPGA [22]. This form
of reconfiguration holds the promise for better resource usage and faster execution of certain
algorithms. However, it requires fast reconfiguration. Currently, FPGAs adopt a method
called partial reconfiguration [2, 27, 29], where only a portion of the FPGA is reconfigured,
while the remainder works on. This involves selecting the portion of the FPGA requiring
reconfiguration and inputting the necessary configuration bits. Due to pin limitation, only
a very coarse selection is available on FPGAs, resulting in a large number of elements being
selected for reconfiguration. Unfortunately, this implies that elements that do not need to
be reconfigured must be “configured” anyway to their existing states along with those that
actually require reconfiguration. Moreover, this process could take multiple cycles of re-
configuration to set all desired elements to the desired configuration. Consequently, current
FPGAs fail to fully exploit the power of dynamic reconfiguration demonstrated on theoretical
models [22].
Selection of elements for reconfiguration is performed by decoders. Technically, an x to
n decoder (where x  n) converts x input bits to n output bits. If these output bits are
viewed as representing elements of an n-element set Zn, then the decoder simply selects the
elements of a subset of Zn. Current FPGAs employ “fixed decoders” that fix the mapping
between input and output bits. In fact, the fixed decoder that is normally employed is the
1-hot decoder [29] that accepts a logn-bit input and generates a 1-element subset of Zn.
That is, a 1-hot decoder can select only a single element at a time. This causes problems
if, in an array of n elements, some arbitrary pattern of elements is needed. Here, selecting
2
an appropriate subset can take up to O(n) rounds. Notwithstanding this inflexibility, 1-hot
decoders are simple combinational circuits with a low O(n logn) gate cost and a low O(logn)
propagation delay.
If flexibility is desired, then a “configurable decoder” is required. Currently, a configurable
decoder amounts to a look-up table (LUT); we will call this existing decoder a pure LUT-
based configurable decoder. A 2x × n LUT is simply a 2x–location table of n-bit entries. It
can produce 2x independently chosen n-bit patterns that can be selected by an x-bit address.
As such, this decoder is highly flexible as the n-bit patterns chosen for the LUT need no
relationship to each other. Unfortunately, it is also costly; the gate cost of such a LUT is
Θ(n2x). For a gate cost of Θ(n log n), this LUT only produces Θ(logn) subsets; to produce
the same number of subsets as a 1-hot decoder, the pure LUT-based configurable decoder
has Θ(n2) gate cost. Clearly, this does not scale well.
In this thesis we propose a configurable decoder that seeks to bridge the gap between the
high flexibility, high cost LUT and the low flexibility, low cost fixed decoder by utilizing both
in the manner shown in Figure 1.1. In the first stage a small LUT provides the flexibility
Mapping
Unit
High Flexibility
High Cost
Low Flexibility
Low Cost
LUT
nzx
FIGURE 1.1: Proposed configurable decoder overview
with low cost as z  n. In the second stage an inexpensive decoder expands the output to
the form required. This solution is somewhat similar to the well-known accelerated cascading
technique for parallel algorithms [16], where a slow but efficient algorithm is used initially
to reduce the problem size sufficiently for a subsequent fast and inefficient algorithm to
complete the solution.
One key result that we derive is that for any gate cost G, such that G ≤ Cn logk n, where
C and k are constant, if a pure LUT-based configurable decoder can produce λ subsets, then
3
our method can produce Ω
(
λ logn
log log n
)
subsets for the same cost. Our decoder also produces
many more dependent subsets (that depend on the choice of the first Ω
(
λ logn
log logn
)
subsets).
A particular case of this result occurs when G = Θ(n logn). The pure LUT-based solution
produces Θ(logn) subsets while we produce Ω
(
log2 n
log log n
)
subsets. That is, for the cost of a
1-hot decoder, our method exceeds the flexibility of a LUT-based decoder. Note that while
the 1-hot decoder or any other fixed decoder can produce just that fixed set of (albeit n)
subsets, configurable decoders can produce different subsets (selected arbitrarily by the user)
at different times (by off-line reconfiguration). This flexibility is important, particularly in
an environment such as an FPGA, whose reconfigurability has made it the platform of choice
for many applications, even in preference to ASICs.
This thesis is organized as follows. In Chapter 2, the motivations for this thesis are
expanded upon, including pin limitation and dynamically reconfigurable systems. We also
explore current solutions to the pin limitation constraint and demonstrate an initial view of
the capabilities of decoders.
In Chapter 3, we provide the notation and assumptions used throughout the remainder
of this thesis. The basic building blocks upon which our configurable decoder is built are
explained and analyzed. We conclude this chapter with an introduction to a configurable
decoder and demonstrate that the pure LUT-based configurable decoder is not feasible as a
solution to pin limitation.
In Chapter 4, we provide the theoretical basis for the mapping unit (Figure 1.1), a key
component of the proposed decoder, which expands the z-bit output of a LUT to the n-bit
decoder output. This chapter lays the groundwork for the capabilities of our configurable
decoder. In Chapter 5, realizations of the mapping unit are presented. We demonstrate
several possible implementations, each with different capabilities and trade-offs.
In Chapter 6, these mapping units are integrated with the preceding LUT to create
configurable decoders. The performance of the proposed configurable decoder is analyzed
for a given gate cost G and compared against the pure LUT-based approach. We show that
in every case, our solution outperforms the LUT in some capacity.
4
In Chapter 7, we show how several examples of classes of algorithms and communications
that have interesting corresponding subsets can be produced by our configurable decoder.
Chapter 8 provides simulation results for different implementations of the configurable de-
coder. Nonlinear regressions performed on this data provide the constants hidden by the
asymptotic notation. This provides insight into future cost trends for the proposed modules.
In Chapter 9, some observations on a more generalized variant of our configurable decoder
are made. Patterns of bits that are difficult for our configurable decoder to produce are
more easily produced by this variant. Finally, in Chapter 10, we summarize our results and
identify some future avenues of research.
5
Chapter 2
Pin Limitation
Input and output (I/O) pins allow communication between the interior and exterior of an
Integrated Circuit (IC) chip. Typically, this communication manifests as signals from an
external source to components within the chip or vice-versa. However, the number of pins
available is limited. This “pin limitation” stems primarily from the extent to which pins can
be miniaturized without compromising their structural integrity. Pin limitation impacts the
design as well as the performance of a chip, as the amount of information that can be passed
through the boundary of the chip is limited by the number of I/O pins.
While, according to Moore’s Law, the number of transistors in a chip doubles roughly
every two years [26], the number of I/O pins available on that chip does not. This is
TABLE 2.1: Intel microprocessor characteristics, 1971–2001 [15]
Processor Year
Feature
Size (µm)
Transistors
Frequency
(MHz)
Package
4004 1971 10 2.3k 0.75 16-pin DIP
8008 1972 10 3.5k 0.5–0.8 18-pin DIP
8080 1974 6 6k 2 40-pin DIP
8086 1978 3 29k 5–10 40-pin DIP
80286 1982 1.5 134k 6–12 68-pin PGA
Intel386 1985 1.5–1.0 275k 16–25 100-pin PGA
Intel486 1989 1–0.6 1.2M 25–100 168-pin PGA
Pentium 1993 0.8–0.35 3.2 - 4.5M 60–300 296-pin PGA
Pentium Pro 1995 0.6–0.35 5.5M 166–200 387-pin MCM PGA
Pentium II 1997 0.35–0.25 7.5M 233–450 242-pin SECC
Pentium III 1999 0.25–0.18 9.5 - 28M 450–1000 330-pin SECC2
Pentium 4 2001 0.18–0.13 42 - 55M 1400–3200 478-pin PGA
6
demonstrated in Table 2.1, where, over the past three decades, the number of transistors
available on an Intel microprocessor has increased by a factor greater than 20,000, while the
number of pins on the microprocessor package has only increased by a factor of 30 (see also
1970 1975 1980 1985 1990 1995 2000 2005
100
101
102
103
104 Normalized No. of TransistorsNormalized No. of Pins
Lo
g.
 o
f N
or
m
al
iz
ed
 V
al
ue
Year 
FIGURE 2.1: Normalized transistor and pin counts for 1971 – 2001
Figure 2.1). Thus, the potential amount of information needed inside a chip has increased
significantly faster than the means to allow that information in and out of the chip. Under
current technology, this trend does not appear likely to change.
While techniques to alleviate pin limitation are applicable in the design of any IC chip,
certain applications benefit more significantly. A reconfigurable system is one such example
that we elaborate upon next.
2.1 Pin Limitation in Reconfigurable Architectures
In a reconfigurable architecture, the functionalities of the components that make up the
architecture and the way in which these components are interconnected can be altered to suit
the demands of a particular application [4, 22]. Such architectures are generally composed
of a mesh of configurable elements connected by a configurable interconnection network. If
7
the architecture can be reconfigured with little overhead, then it is said to be dynamically
reconfigurable [22]. Dynamically reconfigurable architectures can particularly benefit from
an increase in the number of input pins to the chip (as we will show later in this section).
Dynamic reconfiguration has two main benefits. First, a dynamically reconfigurable
architecture can reconfigure between various stages of an application to use its resources
optimally at each stage. That is, it reuses hardware resources more efficiently across different
parts of an algorithm. For example, an algorithm using two multipliers in Stage 1 and eight
adders in Stage 2 can run on dynamically reconfigurable hardware that configures as two
multipliers for Stage 1 and as eight adders for Stage 2. Consequently, this algorithm will run
on hardware that has two multipliers or eight adders, as opposed to a non-reconfigurable
architecture that would need two multipliers and eight adders.
The second benefit of dynamic reconfiguration is a fine tuning of the architecture to
exploit characteristics of a given instance of the problem. For example in matching a sequence
to a given pattern, the internal “comparator” structure can be fine-tuned to the pattern.
Further, this tuning to a problem instance can also produce faster solutions [22].
However, the benefits of dynamic reconfiguration come at a cost. Dynamic reconfigura-
tion has its architectural and algorithmic overheads and can be difficult to realize [22]. Since
the primary motivating factor in our work is reconfigurable computing, we will focus our
discussion on this application area; however, other areas may benefit from this work.
In the next section we begin our discussion of the advantages of dynamic reconfiguration
in the setting of a theoretical model. Then, in Section 2.1.2 we place this advantage in
the context of a practical reconfigurable environment that shows the implications of pin-
limitation in this area.
2.1.1 The R-Mesh: A Theoretical Reconfigurable Model
An R-Mesh is a two-dimensional array of processors connected by an underlying mesh net-
work. Each processor has four ports named by the cardinal directions, North, South, East,
and West, connecting it to its nearest mesh neighbors. These ports can be connected inter-
nally to create seamless buses through multiple processors. As shown in Figure 2.2, there are
8
FIGURE 2.2: A 3 × 5 R-Mesh with all possible port connections; one bus is shown in bold.
a total of 15 possible port connections that allow a rich variety of buses [22]; for clarity, one
of the buses is shown in bold. R-Meshes can also be based on meshes of different dimensions.
For example, a one-dimensional R-Mesh would be a linear array of processors, each capable
of connecting or disconnecting its East and West ports.
As an example of the capabilities of dynamic reconfiguration, consider an N -processor
one-dimensional R-Mesh whose processors are partitioned into k groups, each of size N
k
. A
group is considered “flagged” if any processor in the group is flagged. Suppose we wish to
determine the number of flagged groups. This problem is easily solved on the one-dimensional
R-Mesh in O(log k) time. In contrast, most other models will require Ω(logN) time. In
particular, a one-dimensional (non-reconfigurable) mesh (or linear array) will require Θ(N)
time for this problem.
The algorithm proceeds in three separate stages on the R-Mesh. In the first two stages,
each of the k groups determines if any processor in the group is flagged. This is based on an
algorithm for the finding the OR of N bits [22]. In Stage 1, the first processor of each group
disconnects its ports and never uses its West port, thereby disconnecting its group from the
previous group, if any (Figure 2.3). With each of the k groups now disconnected, Stage 2
boils down to finding the OR on k separate N
k
-processor one-dimensional R-Meshes. If the
first processor of a group is flagged, then it indicates to the rest of the processors in the
group that nothing further need be done. This information can be broadcast on the unique
bus that runs through each group. If the first processor of a group is not flagged then we
proceed as follows. If a processor other than the first processor of the group is not flagged
9
First processor
in a group
if it sees a signal on East port
group updates its value
First processor in each
Group    − 1kGroup 2Group 0
Stage 3
Stage 2
Stage 1
Signals Sent out of West port
Group 1 ...
1 0 1 0 1
... ... ...
...... ... ... ...
FIGURE 2.3: Finding the number of flagged groups on a one-dimensional R-Mesh
(indicated by a ‘0’ in Figure 2.3), it connects its East and West ports; if a processor is flagged
(indicated by a ‘1’ in Figure 2.3), it disconnects its East and West ports (Figure 2.3, Stage
2). Each processor with a flag then sends a signal out its West port. The first processor in
the group listens on its East port, and marks its group as flagged if and only if it detects a
signal on its East port.
In Stage 3, the number of flags is tallied. Since the first processor in each group now
contains all information for that group, this is accomplished using the well-known binary tree
paradigm for reduction algorithms [11, 22] to add the values contained in the first processor
in each group and store the final result in the first processor of the one-dimensional R-Mesh,
i.e., Processor 0. Figure 2.3 illustrates the algorithm with k = 8. In general, since this stage
of the algorithm is implemented as a balanced binary tree with k leaves (groups), it runs in
O(log k) time. Since the first two stages of the algorithm run in a constant number of steps,
the algorithm runs in O(log k) time. These time complexities hinge on the assumption of a
fast, constant delay bus.
This example demonstrates the two main benefits of dynamic reconfiguration (Section 2.1).
In Stage 1 and in the steps of Stage 3, the processing elements connect their ports in a man-
ner determined a priori in order to optimally use resources for a given stage in the algorithm.
10
In Stage 2 however, the processing elements connect their ports in a manner determined by
the value of their flag; this allows the algorithm to take advantage of the given instance of
the problem.
This example also shows several key features of the R-Mesh model that allow it to be
dynamically reconfigurable. The first is the coarse-granularity of the R-Mesh’s processing
elements, which allows them to execute basic instructions in a synchronous environment. In
addition, the processing elements can each change their configurations independently. This
connection autonomy is key to the power of the R-Mesh. The previous example uses a one-
dimensional R-Mesh. The two-dimensional (or higher dimensional) R-Mesh is even more
powerful. However, the arbitrarily shaped busses of the two-dimensional R-Mesh make it
difficult to realize [22]. Thus, while the R-Mesh provides a powerful reconfigurable model,
practical considerations lead us to the discussion in the next section of a currently realizable
reconfigurable platform, the Field Programmable Gate Array (FPGA).
2.1.2 Field Programmable Gate Arrays
A Field Programmable Gate Array (FPGA) is a reconfigurable architecture that extends the
functionality of a traditional Programmable Logic Device (PLD) [5]. While FPGAs were ini-
tially used for rapid prototyping, their ability to configure as any desired circuit allows them
to compete favorably with Application Specific Integrated Circuits (ASICs), particularly in
low to medium yield situations as the high manufacturing cost and slow design cycles of
ASICs are a major disadvantage [5]. As illustrated in Section 2.1.1, dynamic reconfiguration
holds tremendous benefits. While current FPGAs are capable of some limited dynamic re-
configuration, they are not as nimble as the R-Mesh in adapting to a given problem. In this
section we show how a solution to the pin limitation problem can considerably increase the
utility of FPGAs.
A typical FPGA structure consists of a two-dimensional mesh of configurable logic el-
ements connected by a configurable interconnection network [9]. Figure 2.4 shows such a
structure, where the Configurable Logic Blocks (CLBs) are the configurable functional ele-
ments, and the switches (S) are the configurable elements in the interconnection network.
11
SSS
CLB S
SS
CLB
S
SS
CLB S
S
S
S
SS
S
CLBCLBCLB
CLB
CLB
CLB
FIGURE 2.4: A typical FPGA structure
Each CLB in an FPGA is sometimes subdivided into smaller configurable logic elements.
For example, the Xilinx Virtex-5 FPGA’s CLBs each contain two elements known as slices
(Figure 2.5). At the deepest level, the most basic functional element in an FPGA usually
COUT
CINCIN
COUT
CLB
Matrix
Switch
Slice(1)
Slice(0)
FIGURE 2.5: Xilinx Virtex-5 configurable logic block [28]
consists of some combination of one or more Look-Up Tables (LUTs), combinational logic
gates, flip-flops, and other basic logic elements. In the Virtex-5, each slice contains four
64× 1 LUTs, four flip-flops, an arithmetic and carry chain, and several multiplexers used to
12
combine the outputs of the LUTs [28]. Often the CLBs in an FPGA are also interspersed
with other functional units, such as small memory blocks, other adder chains, and multi-
pliers. Thus, a CLB can contain many configurable switches. Notwithstanding variations
in FPGA terminology, we will use the term “CLB” to denote the basic unit represented in
Figure 2.4.
The FPGA’s interconnection network is typically a two-dimensional mesh of configurable
switches. As in a CLB, each switch S represents a large bank of configurable elements. The
state of all switches and elements within all CLBs is referred to as a “configuration” of
the FPGA. Because there is a large number of configurable elements in an FPGA (LUTs,
flip-flops, switches, etc.), a single configuration requires a large amount of information. For
example, the Xilinx Virtex-5 FPGA with a 240× 108 array of CLBs requires in the order of
79 million bits for a single configuration [28, 29]. Unlike the coarse-grained processing ele-
ments of the R-Mesh (Section 2.1.1), the FPGA’s CLBs are fine-grained functional elements
that are incapable of executing instructions or generating configuration bits internally. Thus,
configuration information must come from outside the chip. A limited amount of configura-
tion information can be stored in the chip as “contexts;” however, given the limited amount
of memory available on an FPGA for such a purpose, an application may require more con-
texts than can be stored on the FPGA. Hence, in most cases, configuration information must
still come from outside the chip.
When used as an electronic breadboard, an FPGA can be configured off-line with no
regard to the amount of time needed for configuration. In this work we deal primarily
with dynamic reconfiguration, for which an FPGA must reconfigure while an application is
executing on it. As we noted earlier, since most configuration information must come from
outside the chip and the number of bits needed for a configuration is large, reconfiguration
is time consuming. Because of this, only selected parts of the FPGA are configured in
order to avoid large overheads. This mode of reconfiguring is called partial reconfiguration
[2, 22, 27, 29].
In partial reconfiguration, the information entering the chip can be classified into two cat-
egories: (a) selection and (b) configuration. The selection information contains the addresses
13
of the elements that require reconfiguration, while the configuration information contains the
necessary bits to set the state of the targeted elements.
In order to facilitate partial reconfiguration, FPGAs are typically divided into sets of
frames, where a frame is the smallest addressable unit for reconfiguration. In current FPGAs,
a frame is typically one or more columns of CLBs. Currently, partial reconfiguration can
only address and configure a single frame at a time. If we assume that each CLB receives
the same number of configuration bits, say α, and the number of CLBs in each frame is
the same, say C, then the number of configuration bits needed for each frame is Cα. If the
number of bits needed for selecting a frame is b, then the total number of bits B needed to
reconfigure a frame is:
B = b+ Cα
Since the granularity of reconfiguration is at the frame level, every CLB in a frame
would be reconfigured, regardless of whether or not the application required them to be
reconfigured. This can result in a “poorly-focused” selection of elements for reconfiguration,
as more elements than necessary are reconfigured in each iteration. This implies that a large
number of bits and a large time overhead are spent on the reconfiguration of each individual
frame. If the granularity of selection is increased, i.e., if fewer CLBs are in each frame, then
the number of selection bits needed to address the frames increases while the number of
configuration bits for each frame decreases. However, this also increases the total number of
iterations necessary to reconfigure the same amount of area in the FPGA. As an illustration
of this trade-off, consider the following two examples of reconfiguring an 8× 12 FPGA.
Example 2.1 Consider an 8× 12 FPGA divided into 4 frames, where each frame contains
three columns of CLBs (Figure 2.6). Assume the number of configuration bits per CLB to
be α = 6. Since there are 4 frames, the number of selection bits b = 2. Since each frame
contains 24 CLBs, the number of configuration bits for each frame is Cα = 144. If the
shaded CLBs shown in Figure 2.6 require reconfiguration, then we require 4 iterations of 146
bits each, for a total of 4× (b+ Cα) = 584 bits. (Recall that only one frame can be selected
at a time and that the entire frame must be reconfigured.) Since only 48 bits are necessary
14
Frame 0 Frame 3Frame 2Frame 1
FIGURE 2.6: Reconfiguration of an 8× 12 FPGA of Example 2.1
to reconfigure the desired CLBs in each frame, each iteration of reconfiguration requires 96
bits more than necessary.
1110 2 3 4 5 6 7 8 9 10Frame:
FIGURE 2.7: Reconfiguration of an 8× 12 FPGA of Example 2.2
Example 2.2 As a second example, consider the same 8× 12 FPGA divided this time into
12 frames, where each frame contains a single column of CLBs (Figure 2.7). Again, let α = 6.
Since there are 12 frames, the number of selection bits b is 4. Since each frame contains 8
CLBs, the number of configuration bits for each frame Cα is 48. If the shaded CLBs in
Figure 2.7 require reconfiguration, then we would need 8 iterations of 52 bits each, for a
total of 8× (b+ Cα) = 416 bits. Since only 24 bits are necessary to reconfigure the desired
CLBs in each frame, each iteration of reconfiguration requires 24 bits more than necessary.
Thus, targeting only those elements that require reconfiguration is desirable as it can
decrease the number of configuration bits (Cα). While this increases the number of selection
15
bits b, this increase is generally small compared to Cα. In the extreme, if a frame consisted
of a single CLB, then reconfiguration would require selection bits for both the rows and the
columns of the FPGA, but would reconfigure only those CLBs that required reconfiguration
based on the application. However, since a typical FPGA selects only one element at a
time, this is not practical as the number of iterations would be prohibitive. If the ability to
quickly select only those elements that require reconfiguration is not available, then a good
design choice must weigh the benefits of better focus (smaller frames in Example 2.2) with
the penalty of a larger number of iterations. Since different applications demand different
patterns of reconfiguration, a simple “one-size-fits-all” solution is seldom efficient. Conse-
quently, current FPGAs often do not fully exploit the benefits of dynamic reconfiguration
that a problem holds.
Another advantage of flexibility in selecting entities within an FPGA is in the area of
configuration contexts. A configuration context is a long stream of bits, one per configurable
element of the FPGA. This context is typically distributed across several LUTs. For example,
an FPGAwith a 16-location context LUT in each CLBmay hold 16 contexts, each distributed
over all context LUTs. Loading context 7, for example, would load the contents of location 7
from each of these context LUTs. Thus, the entire FPGA can hold no more than 16 different
contexts. If it is possible to select a location (say 7) from some of the LUTs, and a different
location (say 6) from the rest, 162 = 256 contexts are possible. In the extreme, where each
LUT can be individually addressed, the flexibility approaches the connection autonomy of
the R-Mesh. As in partial reconfiguration, a good mechanism for selecting context LUTs is
advantageous here.
Pin limitation thus creates a severe restriction on the extent to which an FPGA can be
dynamically reconfigured. Clearly more pins will allow parallel input of several configuration
bits. We now explore possible solutions to the pin limitation constraint and introduce our
approach in the next section.
16
2.2 Approaches for the Pin Limitation Constraint
There are a number of ways of alleviating the effects of the pin limitation problem. These
include (1) multiplexing, (2) storing information within the design, and (3) decoding. The
first two approaches are discussed briefly. Our solution is the decoding approach.
Multiplexing: The concept of multiplexing refers to combining a large number of chan-
nels into a single channel. This can be accomplished in a variety of ways depending on
the technology. Each method assumes the availability of a very high speed, high band-
width channel on which the multiplexing is performed. For example, in the optical domain,
wavelength division multiplexing allows multiple signals of different wavelengths to travel si-
multaneously in a single waveguide. While some optoelectronic FPGAs have been proposed
[23, 24], this is far from practice. Time division multiplexing requires the multiplexed signal
to be much faster than the signals multiplexed. Used blindly, this is largely useless in the
FPGA setting, as it amounts to setting an unreasonably high clocking rate for the FPGA.
A more innovative use of multiplexing is described below.
Storing Information Within the Design: This attempts to alleviate the pin limita-
tion problem by generating most information needed for execution of an application inside
the chip itself (as opposed to importing it from outside the chip). This requires a more
“intelligent” chip. In an FPGA setting it boils down to an array of coarse grained processing
elements rather than simple functional blocks (CLBs). One example is the use of virtual
wires [3], in which each physical wire corresponding to an I/O pin is multiplexed among mul-
tiple logical wires. The logical wires are then pipelined at the maximum clocking frequency
of the FPGA, in order to utilize the I/O pin as often as possible. Another example of such a
solution is the Self-Reconfigurable Gate Array (SRGA) architecture [12, 21]. However, this
approach is a significant departure from current FPGA architectures.
Decoders: A decoder is typically a combinational circuit that takes in as input a relatively
small number of bits, say x bits, and outputs a larger number of bits, say n bits, according
17
to some mapping; such a decoder is called an x to n decoder. If the x inputs are pins to
the chip and the n outputs are expanded within the chip, a decoder provides the means to
deliver a large number of bits to the interior of the chip (see Figure 2.8). An x to n decoder
can clearly produce no more than 2x output sequences.
x nGeneral
Decoder
Outside chip Inside chip
FIGURE 2.8: An x to n decoder in an IC chip
If Zn = {0, 1, . . . , n − 1}, then each output sequence of the decoder can be interpreted
a subset of Zn. Let S be a set of “desired” subsets of Zn that need to be generated by the
decoder. For example, let n = 8 and Zn = {0, 1, 2, . . . , 7}. For this example, Table 2.2 shows
TABLE 2.2: An illustration of a decoder for four sets Si of subsets of Z8
Decoder Inputs S0 S1 S2 S3
000 00000001 01010101 11111111 00001101
001 00000010 10101010 00001111 10010010
010 00000100 00110011 00000011 10100010
011 00001000 11001100 00000001 00111101
100 00010000 00001111 11110000 01001110
101 00100000 11110000 11000000 11010001
110 01000000 11111111 10000000 11100001
111 10000000 00000000 00111100 01111110
a decoder input sequence with four corresponding sets of subsets, S0,S1,S2,S3, where:
S0 = {{0}, {1}, {2}, . . . , {7}}, the subsets for a 1-hot decoder,
S1 = {{0, 2, 4, 6}, {1, 3, 5, 7}, . . . , ∅}, the subsets representative of ASCEND/DESCEND
communication patterns [1],
18
S2 = {{0, 1, 2, 3, 4, 5, 6, 7}, {0, 1, 2, 3}, . . . , {2, 3, 4, 5}}, the subsets representative of a type
of reduction [11, 22] and
S3 = {{0, 2, 3}, {1, 4, 7}, . . . , {2, 3, 4, 5, 6}}, an “arbitary” collection of subsets.
For example, if the decoder produces S2, then, for input ‘100’, the output is ‘11110000’.
In an FPGA setting a subset Sj ∈ S typically represents a subset of CLBs that need
to be reconfigured. To accomplish this reconfiguration in one iteration, the decoder must
generate a superset Ŝj of Sj , i.e., Ŝj ⊇ Sj . Then, reconfiguring all CLBs of Ŝj and reloading
existing states of all CLBs of Ŝj − Sj achieves the desired effect.
We now identify some “performance properties” of a decoder operating in this setting.
• Cost: The hardware cost of the decoder, typically given as the number of gates.
• Speed: The amount of time needed to generate Ŝj for any Ŝj ∈ S. This could be
some function of the delay of the decoder and the number of iterations over which the
decoder generates Ŝj.
• Focus: This is max{|Ŝj − Sj | : Sj ∈ S}. This parameter measures how accurately the
decoder can generate the required subset Sj.
• Flexibility: This is an indication of how easily S can be altered.
Current decoders in FPGAs are 1-hot decoders that expand log n input bits to n output
bits with only one of the n output bits set to 1 (such as the set S1 in Table 2.2). These have
a low Θ(n logn) cost and high speed (Θ(log n) delay per iteration). However, as noted in
Section 2.1.2, this allows FPGAs to only reconfigure one frame at a time (requiring multiple
iterations), and is also not flexible (as the decoder cannot produce any other set of subsets
Si). Because a frame in current FPGAs contains many CLBs, the focus of current FPGA
decoders can be very poor (as was illustrated in Examples 2.1 and 2.2). Thus, we look to
design a decoder with low cost (Θ(n logn)) and high speed (Θ(log n) delay per iteration)
that is flexible and can provide a higher degree of focus for addressing interesting sets (such
as S3 in Table 2.2). Achieving all this will also ensure a small number of iterations. We
begin a discussion of our solution in the next chapter.
19
Chapter 3
Preliminaries
As we observed in Chapter 2, the approach we adopt is the use of decoders to alleviate the
pin limitation constraint. Unlike conventional “fixed” decoders, we propose a “configurable”
decoder that has the speed and low cost of fixed decoders but with considerably higher
flexibility and focus in selection.
This chapter introduces the basic ideas needed for the remainder of this thesis. We begin
in Section 3.1 by outlining the assumptions and notation used throughout our work. In
Section 3.2, we define the building blocks that are used to construct the various versions of
a configurable decoder. Finally, in Section 3.3 we introduce the structure of the configurable
decoder itself before providing more details in subsequent chapters.
3.1 Assumptions and Notation
A configurable decoder is a combinational circuit (with the exception of the bit-slice units
detailed in Section 5.3), that, in order to achieve a degree of flexibility, uses Look-Up Tables
(LUTs). While LUTs could be implemented using sequential elements, for this work, LUTs
are functionally identical to combinational memory such as ROMs. Some of the ideas we
discuss here assume the configurable decoder to be a combinational circuit. The minor
extensions needed for the sequential circuits of Sections 5.3 will be discussed later.
To allow us to work at a conveniently abstract level while accounting for realistic con-
straints, we make the following assumptions.
1. All gates are assumed to have a constant fan-in and fan-out of at least 2; that is, the
maximum number of inputs to a gate and the maximum number of other gates driven
by the output of a given gate are independent of the problem size. When the fan-out
of a signal in a circuit exceeds the driving capacity of a gate, buffers are inserted into
the design. These additional buffers increase the cost and delay of the circuit. Gates
typically have a fixed number of inputs. Realizing gates with additional inputs boils
20
down to constructing a tree of gates. Assuming a nonconstant fan-in and fan-out would
ignore the additional gate cost and delay imposed by these elements.
2. We assume that each instance of a gate has unit cost and delay. While the cost and
delay of some logic gates (such as XOR) is certainly larger than the size and delay of
smaller logic gates (such as NAND in some technologies), the overall number of gates in
the circuit and the depth of the circuit provide a better measure of the circuit’s costs,
rather than factors arising from choices specific to a technology and implementation.
3.1.1 Performance Parameters
We divide the performance parameters into two categories: independent parameters and
problem dependent parameters. Independent parameters are applicable to all circuits, while
problem dependent parameters are specific to decoders. All parameters are expressed in
terms of their asymptotic complexity to avoid minor variations due to technology and other
implementation-specific details.
Independent Parameters:
Gate Cost G: the gate cost of a circuit is the number of gates (AND, OR, NOT) in it.
Clearly, the use of other gates such as NAND, XOR, etc. will not alter the gate cost
expressed in asymptotic notation.
Delay D: the delay or time cost of a combinational circuit is the length of the longest path
from any input of the circuit to any output.
Problem Dependent Parameters: As we noted earlier in Chapter 2, the basic function
of a decoder can be interpreted as that of selecting subsets of a set. Consider an x to
n decoder. Functionally, this decoder accepts x input bits and produces n output bits,
where x  n. Since the decoder is a combinational circuit, x input bits produce at most
2x different outputs. Each n-bit output can be interpreted as a subset of an n-element set1
1We call a set of n elements an n-set and a subset of an n-set as an n-subset.
21
Zn = {0, 1, . . . , n−1} using the standard characteristic function representation, that is, each
bit position of the n-bit output corresponds to an element of Zn and the bit value indicates
membership status. As an example of this notation, consider n = 8 and Zn = {7, 6, . . . , 0}.
Then an 8-bit binary string ‘0001101’ represents the subset {4, 3, 0}, where the leftmost
position is for element 7 and the rightmost for element 0.
For any integer n ≥ 1, let Zn = {0, 1, . . . , n − 1} be the set whose subsets the decoder
is to produce. Let ℘ (Zn) denote the powerset of Zn, that is, the set of all subsets of Zn;
clearly,2 |℘ (Zn) | = 2n. We can represent the elements of ℘ (Zn) by the 2n values of an
n-bit string. Let the decoder produce the set S ′ ⊆ ℘ (Zn) of subsets of Zn and let |S ′| = Λ.
In summary, the problem is to construct a decoder with x-bit inputs and n-bit outputs
such that the set S’ of subsets it generates “is sufficient” for the application at hand. Different
applications require different sets of subsets of Zn, and do so with different constraints on
speed and cost.
For a configurable decoder, a portion of the hardware can be changed (off-line). This
allows one to freely select a portion of the subsets produced by the decoder. Let S ⊆ S ′
denote the portion of subsets that can be produced independently (by configuring the decoder
in any manner of choice). Ideally, S = S ′, but in some variants of the configurable decoder,
this may make the cost prohibitive (see Section 3.3). In others, ∅ ⊂ S ⊂ S ′; that is, the
decoder allows some of the subsets to be chosen arbitrarily while others are defined by this
choice. In fixed (non-reconfigurable) decoders, S = ∅. From this perspective, we define the
following two parameters that are specific to decoders.
1. Number of independent subsets: λ = |S|
2. Total number of subsets: Λ = |S ′|
Since flexibility is important for a reconfigurable platform, the number of independent sub-
sets, rather than simply the total number of subsets, is emphasized in this thesis.
2For any set S, we denote its cardinality by |S|.
22
3.1.2 Other Notation and Concepts
The following notation is used throughout this thesis. In general we will denote inputs,
outputs, and intermediate signals by uppercase letters such as A,B,Q, etc. Each of these
signals could be several bits wide. The number of bits in a signal A is denoted #(A), and
the bits themselves are denoted by A(#(A) − 1), . . . , A(1), A(0). The signal A can take up
to 2#(A) values. We will denote the set of values that signal A can have also by the letter
A; typically the context will make the distinction clear. Thus, if a is a #(A)-bit value that
signal A can have, then we will say that a ∈ A.
We now briefly discuss some well known concepts, as they will be used extensively in
Chapter 4.
All logarithms are to base 2. For any n > 0, and any integer k ≥ 0, logk n = (log2 n)k,
whereas log(k) n = log log . . . log n︸ ︷︷ ︸
k times
. Note that log(k) n 6= logk n 6= lognk = k log n.
Let S be an n-element set. A partition [17] of S is a division of the elements of the set
into disjoint non-empty subsets, S0, S1, . . . , Sk−1. More formally, a partition pi of set S is a
set of nonempty subsets {S0, S1, . . . , Sk} such that
k−1⋃
i=0
Si = S and Si ∩ Sj = ∅, for i 6= j. A
partition pi with k blocks (0 ≤ k < n) is called a k-partition of S. For example, a 3-partition
of the set S = {8, 7, 6, 5, 4, 3, 2, 1, 0} is pi = {{7, 6, 5, 4}, {3, 2}, {1, 0}}.
A useful operation on partitions is the product of two partitions pi1 and pi2, which can be
defined as follows. Let pi1 = {S0, S1, . . . , Sk} and pi2 = {P0, P1, . . . , P`} be partitions of set S.
Define the product pi1pi2 of pi1 and pi2 to be a partition {Q0, Q1, . . . , Qm} such that for any
block Qh ∈ pi1pi2, elements a, b ∈ Qh if and only if there exists blocks Si ∈ pi1 and Pj ∈ pi2 such
that a, b ∈ Si ∩ Pj. As an example, consider the partitions pi1 = {{7, 6, 5, 4}, {3, 2}, {1, 0}}
and pi2 = {{7, 6}, {5, 4, 3, 2}, {1, 0}}. Then pi1pi2 = {{7, 6}, {5, 4}, {3, 2}, {1, 0}}.
3.2 Building Blocks
In this section we define and analyze some basic hardware structures that will serve as
building blocks for subsequent designs. Specifically, we describe fan-in and fan-out circuits,
1-hot decoders, multiplexers, look-up tables (LUTs), shift registers, and modulo-α counters.
23
As these building blocks are used extensively in subsequent chapters, we summarize their
asymptotic costs and delays in Table 3.1.
TABLE 3.1: Asymptotic gate cost and delay of building blocks
Building Block Gate Cost Delay
Fan-in and Fan-out O(fz) O(log f)
1-hot Decoder O(z2z) O(z)
Multiplexer O(z2z) O(z)
LUT O(2z(z +m)) O(z + logm)
Shift Register
(
α, z
α
)
O(z) Θ(1)
Modulo-α counter O(log2 α) O(log logα)
z = number of input bits.
m = number of output bits.
f = fan-in or fan-out of a signal.
3.2.1 Fan-in and Fan-out
While we assume that the fan-in and fan-out of logic gates is a constant, signals may have
to be fanned in from or fanned out to a non-constant number of places. In these cases, gates
are inserted into the design to provide additional driving and fan-in capabilities. Since these
additional elements increase the cost and delay of the circuit, we discuss a general method
to fan-in a signal from more than a constant number of places and fan-out a signal to more
than a constant number of places.
If the number of places the signal is fanned-in from or fanned-out to is f and the width
of the signal is z-bits, we denote this as a fan-in and fan-out problem of degree f and width
z, respectively. The fan-in and fan-out problems can be stated as follows.
For integers f, z ≥ 1, let U0, U1, . . . , Uf−1 be f signals each z-bits wide. A “fan-in
circuit” of degree f and width z (Figure 3.1 (a)) produces a z-bit output W such that for
any 0 ≤ i < z,
W (i) = U0(i) ◦ U1(i) ◦ . . . ◦ Uf−1(i),
24
10
z
z
z
z
W
U
U
U
.
.
.
.
.
.
f−1
U
z
0W
1W
W
z
z
z
−1f
.
.
.
.
.
.
(a) (b)
FIGURE 3.1: Fan-in problem (a) and fan-out problem (b) of degree f and width z
where ◦ is an associative Boolean binary operation.
For integers f, z ≥ 1, let U be a signal z-bits wide. A “fan-out circuit” of degree f
and width z (Figure 3.1 (b)) produces f z-bit outputs W0,W1, . . . ,Wf−1 such that for any
0 ≤ i < z and 0 ≤ j < f ,
Wj(i) = U(i).
Using a standard balanced tree construction, we have the following result.
Lemma 3.1 Fan-in and fan-out circuits of degree f and width z can be constructed with a
gate cost of O(fz) and a delay of O(log f).
Proof: A balanced binary tree with f leaves has Θ(f) nodes and Θ(log f) depth. Having z
such trees results in Θ(fz) nodes, each with a depth of Θ(log f).
3.2.2 Fixed Decoders – 1-hot Decoders
As stated previously (Section 2.2), one method of rating the performance of a decoder is
its flexibility, that is, how easily the set of outputs of the decoder (that is, the set S) can
be changed. This divides decoders into two broad classifications: fixed decoders, which are
inflexible (S cannot be changed), and configurable decoders, where S can be changed in
some manner (typically off-line). This section explores an important fixed decoder known
as a 1-hot decoder; many ideas developed for 1-hot decoders are applicable to other fixed
decoders as well. Section 3.3 will introduce implementations of configurable decoders.
In a z to n fixed decoder (Figure 3.2), the manner in which the z-bits of the input signal
25
U W
z Fixed n
Decoder
FIGURE 3.2: A fixed z to n decoder
U are expanded to the n-bit output signal W is fixed at manufacture. Since there are 2z
possible values for a z-bit signal, the above fixed decoder can have up to 2z possible n-bit
outputs.
TABLE 3.2: Two possible 1-hot bit patterns for z = 3, n = 8
z-bit Input Active High Output Active Low Output
000 00000001 11111110
001 00000010 11111101
010 00000100 11111011
011 00001000 11110111
100 00010000 11101111
101 00100000 11011111
110 01000000 10111111
111 10000000 01111111
A common selection pattern that is desirable for many applications is the 1-hot bit
pattern, an example of which can be seen in Table 3.2. For a 1-hot decoder, n = 2z and each
of the n-bit patterns has only one active bit (usually with a value of ‘1’), all other bits being
inactive (usually ‘0’). This decoder selects one element at a time. Usually, such a decoder
also has a select input that allows the output set to be ∅. The 1-hot decoder is used so often
that the term “decoder” is usually taken to mean a 1-hot decoder. Currently, FPGAs also
use 1-hot decoders to select frames during partial reconfiguration (Section 2.1.2).
A simple log n to n 1-hot decoder implementation (for example, [6, 26]) consists of n
AND gates with true and complementary versions of the input bits. Since a bit of the output
26
signal must be ‘1’ if and only if all other output bits are ‘0’, the AND gate corresponding
to a given bit in the output has a unique sequence of true and complementary versions of
the input bits. Figure 3.3 illustrates this for a 4 to 16 1-hot decoder. The basic idea of
this implementation is to identify the min-term that causes a bit to be active. Since each
output is active on exactly one input combination, a simple gate (of sufficiently large fan-in)
suffices. In general for a z to 2z 1-hot decoder, each input fans-out to 2z gates and each
gate accepts z inputs. Thus, a general implementation has the form shown in Figure 3.4.
We now have the following result. In Chapter 9 we outline a more sophisticated approach
that implements a 1-hot decoder of Θ(n) cost.
Lemma 3.2 For any z ≥ 1, a z to 2z 1-hot decoder can be implemented as a circuit with
a cost of O(z2z) and a delay of O(z).
Proof: The fan-out circuit of Figure 3.4 has a delay of Θ(z) and cost Θ(z2z) (see Lemma 3.1).
Each of the 2z fan-in circuits of Figure 3.4 has a delay of Θ(log z) and a cost of Θ(z). So,
the total delay is Θ(z + log z) = Θ(z) and the total cost is Θ(z2z + z2z) = Θ(z2z).
Remark: Often larger decoders are built using smaller decoders as building blocks. This
amounts to using the construction of Figure 3.4.
In general, it is difficult to predict the exact cost of a fixed decoder. One class of fixed
decoders where input bits are simply fanned out to form output bits that has a low cost is
used in our result (see Chapters 4 and 5).
3.2.3 Multiplexers
A multiplexer (MUX) is a combinational circuit that selects information from many inputs
and directs it to a single output line (for example, [6]). In general, a 2z to 1 multiplexer
takes 2z data bits,3 and using z control bits, selects one of the 2z data inputs and directs it
to a single output line (Figure 3.5).
3For the purpose of this work, we will assume that the multiplexer takes in as input 2z 1-bit data signals.
In general, these signals could be replaced with signals of any width w with no change to its delay but with
an added Θ(w) factor to the gate cost.
27
(7)W
(6)W
(5)W
(4)W
(3)W
(2)W
(1)W
(0)
W
W (15)
(14)W
(13)W
(12)W
(11)W
(10)W
(9)W
(8)
W
enable
(3)
(2)
(1)
(0)
U
U
U
U
FIGURE 3.3: A logic circuit for a 4 to 16 1-hot decoder. Note the use of an enable signal
to force the output of the decoder to ∅.
28
WU
z
W
bits to 1 bit,
Fan−in of 
z2
z2 times
z
places
bits fan−out toz
AND
AND
AND
AND
z
z
z
z
z(2 − 1)
(2)
(1)
(0)
W
W
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
FIGURE 3.4: General implementation of a 1-hot decoder
A 2z to 1 multiplexer can be constructed as a combinational circuit using 2z AND gates,
each with (z + 1) inputs, and a 2z-input OR gate. An example of such a multiplexer with
four inputs is shown in Figure 3.6. Each of the four data inputs, U0, U1, U2, U3, is selected
via an AND gate and a combination of the two control bits V (0) and V (1), much like in the
1-hot decoder. The logic in Figure 3.6 generalizes to the following result.
Lemma 3.3 A 2z to 1 multiplexer can be implemented as a circuit with a gate cost of O(z2z)
and a delay of O(z).
Proof: Each of the z selection bits are required to select one of the 2z inputs to the multi-
plexer. This requires a fan-out of the z selection bits to 2z places, requiring a gate cost of
O(z2z) and a delay of O(log z) (Lemma 3.1). Each of the 2z inputs are then combined with
the z selection bits. This requires 2z AND gates, each with a fan-in of z + 1 bits, repeated
29
−1U z2
(Data)
z
(Control)
W
.
.
.
.
.
.
1U
0U
FIGURE 3.5: Multiplexer block diagram
2z times. By Lemma 3.1, this implies a gate cost of O(z2z) and a delay of O(log z). Finally,
each of the 2z-bits resulting from the previous fan-in operations must be fanned-in to a single
output using an OR gate. By Lemma 3.1, this has a gate cost of O(2z) and a delay of O(z).
Overall, the multiplexer has a gate cost of O(z2z + z2z + 2z) = O(z2z) and a delay of
O(log z + log z + z) = O(z).
3.2.4 Look-up Table
A 2z×m look-up table (LUT) is a storage device (Figure 3.7) withm2z storage cells organized
as 2z m-bit words. This LUT has as input z-bits to address the 2z locations and outputs an
m-bit word. While a LUT can act as a basic memory device, LUTs have a variety of other
applications, such as implementing small logic functions. A 2z×m LUT can implement any
m Boolean functions of z variables by storing its truth tables [6]. This is of particular use in
FPGAs, where Static Random Access Memory (SRAM) based LUTs with four to six inputs
are commonly used to implement Boolean functions [19].
While LUTs can be implemented in a variety of ways, all LUTs require the same two
components: a memory array and a method of addressing a word in the memory array. One
possible method of addressing the LUT is to use a z to 2z 1-hot decoder. The output of
the 1-hot decoder activates a wordline and enables the outputs of the memory storage cells.
Each of the memory storage cell outputs are then fanned-in to form a m-bit output word
(Figure 3.8).
30
3VV (0) (1)
U
U
U
U
0
1
2
W
FIGURE 3.6: A 4 to 1 multiplexer circuit
This implementation is independent of the choice of memory storage elements. SRAM–
based LUTs are perhaps the most common implementation; however, with minimal modifi-
cations this basic design can easily accomodate other memory cell types. Dynamic Random
Access Memory (DRAM) based LUTs would require the addition of sense amplifiers and
write line decoders; Read-Only Memory (ROM) such as Flash, Erasable Programmable
ROMs (EPROM), or Electrically Erasable Programmable ROMs (EEPROM) would require
an additional layer of polysilicon and some additional column logic [26]. LUTs composed of
sequential elements are also possible, however this would require the use of a clock. This
clock can be independent of any other clock in the system. Regardless of the implementation
chosen, the asymptotic cost of the structure is unchanged; choices in memory technology only
alter the size and access times of the LUT by a constant factor. Thus, we may consider the
LUT to be a combinational element as stated in Section 3.1.
31
enable
WU LUT
mz m
z2
FIGURE 3.7: Look-up table block diagram
W W
Memory Array
U
z
m
z
W
OROROR
Decoder
...
...
1−hot
enable
rows < 2
width = 
fan−in
fan−out
.........
...
...
...
−2)−1) mm (( (0)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
FIGURE 3.8: Look-up table implementation
Lemma 3.4 A 2z × m look-up table can be implemented as a circuit with a gate cost of
O(2z(z +m)) and a delay of O(z + logm).
Proof: Using the implementation described previously and shown in Figure 3.8, the LUT
consists of two modules: the decoder and the memory array. By Lemma 3.2, a z to 2z 1-hot
decoder has a gate cost of O(z2z) and a delay of O(z). Each of the outputs of the decoder
selects a single row of the memory array and drives a word-line. The selection of all elements
in a row of the memory array requires a fan-out of degree m that occurs at most 2z times,
which by Lemma 3.1 results in a gate cost of O(m2z) and a delay of O(logm).
32
As the LUT has 2z rows each with m storage elements, the minimum gate cost for the
memory array is O(m2z). When a row in the memory array is selected, each of the m-bits
must be fanned-in to the output from the 2z rows. This results in a fan-in of degree 2z that
occurs m times, resulting in a gate cost of O(m2z) and a delay of O(z) (Lemma 3.1).
The overall gate cost is thus O(z2z+m2z+m2z+m2z) = O(2z(z+m)) while the overall
delay is O(z + logm+ z) = O(z + logm).
3.2.5 Shift Register
An α-position shift register of width z
α
(Figure 3.9), denoted by SR(z, z
α
), accepts as input a
z
αz
SR(α, z
α
)
clock
serialize
U W
-
-HH
 
  -
 
  -
FIGURE 3.9: An α-position shift register of width z
α
z-bit signal and, every clock cycle, outputs z
α
-bit slices of the input signal, for α clock cycles.
Figure 3.10 illustrates an implementation of SR(z, z
α
). At each clock cycle, if the value of
the input signal serialize is ‘0’, the z
α
register either shifts its contents to the z
α
register
to its left or, if it is the last register in the chain, outputs its contents (via signal Wα−1).
When serialize is asserted, a new value is stored in the α z
α
-bit registers. The shift register
serializes the z-bit signal U based on the value of α; clearly, if α = z, the shift register
outputs each bit of the input signal sequentially. Where serialize = 0, a signal can be
serially shifted in, z
α
bits at a time and output in parallel through lines Wα−1,Wα−2, . . . ,W0
after α cycles. From this construction, we have the following result.
Lemma 3.5 An α-position shift register of width z
α
, SR(α, z
α
), can be realized as a circuit
with a gate cost of O(z) and a constant delay between clock cycles.
33
Wa−1W 0W
1 0
...
...
...
...
...
...
clock
serialize
...
...
U
z
shift in
0
α 
−2
α
z
α
z
α
z
α
z
α
z
α
z
1 0 1 0
z
α
z
α
register zα−bit register
z
α−bit registers
a−1 a−2
a
α
z
α
z
z
α−bit register
z
α−bit
FIGURE 3.10: An implementation of a SR(α, z
α
)
Proof: The shift register consists of α banks of z
α
registers, each of which is constructed
from a constant number of gates and flip-flops, implying a gate cost of O(α( z
α
)) = O(z) and
a constant delay. By Lemma 3.3, each of the 2 to 1 MUXs of z
α
width have a O( z
α
) gate
cost and a constant delay. During a change of state, that is, a shifting of the contents of the
registers or an input of a new signal, all propagation paths have a constant fan-out, implying
a constant delay. Thus, the overall gate cost is O(z + αz
α
) = O(z) and the overall delay is
constant between clock cycles.
34
3.2.6 Modulo-α Counter
For any α ≥ 1, a modulo-α (or mod-α) counter [6] (Figure 3.11) increments its output by
reset
clock
Modulo−α
Counter
log α
W
FIGURE 3.11: A modulo-α counter block diagram
‘1’ every clock cycle, returning to ‘0’ after a count of α − 1. Let 2d−1 < α ≤ 2d. We first
construct a mod-2d counter with synchronous reset (see Figure 3.12). Then, we use this to
construct a mod-2 counter. Let W = W (d − 1)W (d − 2) . . .W (1)W (0) be a d-bit signal.
Let k (where 0 ≤ k < d) be the smallest index such that W (k) = 0; k = d − 1 implies that
W has no 0s. Incrementing W amounts to complementing bits W (k),W (k − 1), . . . ,W (0).
That is, (W +1) (mod-2d) = W (d− 1)W (d− 2) . . .W (i+1)W (i) W (i− 1) . . .W (1) W (0).
clock
reset W
count dMod−2
Counter
R
Eenable E R old W new W
1 0 2d − 1 0
1 0 W < 2d − 1 W + 1
1 1 - 0
0 - - W
FIGURE 3.12: A mod-2d counter with truth table
Let V (i) = 1 if and only if i ≤ k, then the new value of W is W ⊕V , where W is the old
value and ⊕ denotes a bitwise Exclusive OR.
Observe that V (0) = 1 and for all 0 < i < d, V (i) = V (i − 1) AND W (i − 1). Solving
this recurrence we have V (i) =
i−1∧
j=0
W (j) for 0 < i < d. Factoring in the reset and enable
35
lines we now have
W =

W, if E = 0
0, if R = 1, E = d
W ⊕ V, if R = 0, E = 1.
D
...
Q
clock
E
0
1
)−1i(W
0( )
AND
W
R
placesαTo
.
.
.
.
.
W(i
.
)
V (i )
FIGURE 3.13: Circuit for bit i of a synchronous counter
Figure 3.13 shows the logic needed to compute the new value ofW (i). The combinational
delay (between each clock tick) of a 2d-bit counter is
d−1
max
i=0
O(log i)︸ ︷︷ ︸
fan-in
+ O(log d)︸ ︷︷ ︸
fan-out
 = O(log d).
The delay to fan-out E and R to all d flip-flops is factored into the fan-out. The gate cost
is O
((
d−1∑
i=0
i
)
d
)
= O(d2). This subsumes the O(d) cost of fanning out E and R to all d
flip-flops. Therefore we have the following result.
Lemma 3.6 A mod-2d counter can be realized as a circuit with a gate cost of O(d2) and a
delay of O(log d).
To construct a mod-α counter from such a structure requires resetting the counter when
the value of W is α − 1. This is accomplished by adding an α − 1 detection unit that
determines if the output of the counter is α − 1 and, if so, asserts the reset input in time
for the next clock tick (Figure 3.14). This computation can be performed by an AND gate
with true and complementary inputs corresponding to the value of α − 1. For example, if
36
reset
clock
...
W(0)
W
W
W
(1)(2)Synchronous
Counter
(log α −1)
α−1 detection
unit
.
.
.
.
.
.
FIGURE 3.14: A modulo-α counter implementation using a synchronous counter and a mask
computation
α − 1 = 5 = 101 (a mod-6 counter), then the AND gate would complement the second
least-significant input bit coming from the output of the counter. The output of the AND
gate would only be a ‘1’ if and only if the input to the AND gate was ‘101’. As shown in
Figure 3.14, this would assert the reset input to the counter and set the counter to ‘0’ after
the clock tick. From Lemma 3.6 we have the following result.
Lemma 3.7 A mod-α counter can be implemented as a circuit with gate cost of O(log2 α)
and a delay of O(log logα).
3.3 Configurable Decoders
A configurable decoder has the same basic functionality as the general decoder described
in Section 3.1.1. An x to n configurable decoder accepts an x-bit input and outputs up
to 2x n-bit outputs. As mentioned in Section 3.1.1, unlike fixed decoders the output of a
configurable decoder (the set S ′) is not fixed at manufacture. With reconfiguration, the n-bit
outputs can be changed to a different pattern of bits, thus supplying a degree of flexibility
not present in fixed decoders.
37
The simplest implementation of an x to n configurable decoder is a 2x×n LUT. As noted
in Section 3.2.4, a 2x × n LUT takes in an x-bit input and outputs up to 2x n-bit words,
where the n-bit words are determined by the contents of its memory array. Unfortunately,
this “pure LUT-based” configurable decoder is expensive. By Lemma 3.4, the gate cost of
this LUT is O(2x(x+m)). If this decoder was implemented on the same scale as a logn to
n 1-hot decoder, then x = log n. This results in a decoder that, while able to produce any n
of the 2n subsets of Zn, has a gate cost of Θ(n2). On the other hand, if the pure LUT-based
configurable decoder were restricted to the same asymptotic gate cost as the 1-hot decoder
(that is, Θ(n logn)), it would only be able to produce Θ(log n) subsets of Zn (being at most
a log n × n LUT). Although the flexibility of the pure LUT-based configurable decoder is
desirable, its cost does not scale well and an alternative is needed.
Unit
Mapping
LUT
y
nzx QU
B
A
FIGURE 3.15: A configurable decoder block diagram
Our solution, which will be explained in depth in subsequent chapters, is a configurable
decoder that uses a LUT with a smaller order of cost, combined with a special type of decoder
called a ‘Mapping Unit’ (Figure 3.15). The mapping units we consider have the same order
of cost as the LUT. This allows the LUT cost to be kept as small as a fixed decoder while
allowing a large number of n-bit subsets to be produced within the same order of gate cost as
fixed decoders. Chapters 4 and 5 will further explain the capabilities of the mapping units,
while Chapter 6 will explore the capabilities of our configurable decoder.
38
Chapter 4
The Mapping Unit: Theory
Recall that the main problem we address is that of producing subsets of a n-set Zn. As we
showed in Section 3.3, a pure LUT-based configurable decoder with log n input bits is capable
of producing up to n of the 2n different subsets of Zn, but its Θ(n2) cost does not scale well.
Thus, we seek to create a configurable decoder (as shown in Figure 3.15) that, while still
using a LUT to achieve a degree of flexibility, does so with a smaller cost. We introduce in
this chapter a module called the mapping unit (Figure 4.1) that serves to convert the output
MU (z,y,n,α)
z
y
B
U
 
  -
 
 
6
Q
n -
 
 
FIGURE 4.1: A mapping unit decoder block diagram
of an inexpensive LUT to the form representative of a subset of a n-set.
This chapter introduces the functionality of the mapping unit and derives some bounds
on its capabilities. In Section 4.1 we provide a general view of the mapping unit, including
a functional description of its operation (Section 4.1.1) and an explanation of the mapping
of the z-bit inputs to the n-bit outputs (Section 4.1.2). In Section 4.2, we explore the
bounds on the capabilities of the mapping unit, namely, the number of independent subsets
producible by the mapping unit (Section 4.2.1) and the total number of subsets it can produce
(Section 4.2.2). Later in Chapter 5 we will describe realizations of the mapping unit.
4.1 A General View of the Mapping Unit
As previously noted, in the larger context of the configurable decoder the mapping unit serves
to convert the output of a 2x×z LUT to an n-bit output representing a subset of an n-set. The
39
mapping unit can be viewed as a type of decoder, as it takes in a relatively small number of
bits (z-bits) and expands them to a larger number of bits (n-bits), where z < n. The mapping
unit accomplishes this expansion by “multicasting” the z-bits to n places. As an example,
1
0 0 0 01 1 1 1
0 1 1
(a)
1
0 0 0 01 1
0 10
0 0
(b)
1
0 0 1 10
110
110
(c)
1
0 0 10
100
10 0 0
(d)
FIGURE 4.2: Multicasts of 4-bits to 8-bits, for two different multicast schemes each with
two different values.
consider a multicast of four bits a(3)a(2)a(1)a(0) to 8 bits b(7)b(6)b(5)b(4)b(3)b(2)b(1)b(0),
such that b(0) = a(0), b(1) = b(3) = b(5) = b(7) = a(3), b(2) = b(6) = a(2) and b(4) = a(1).
If a = 0111, then b = 01010101 (Figure 4.2(a)). On the other hand, if a = 0011, then
b = 00010001 (Figure 4.2(b)). If we change the mapping of a to b, then again, different
outputs can be obtained. For example, if b(0) = a(0), b(1) = a(1), b(2) = b(3) = a(2) and
b(4) = b(5) = b(6) = b(7) = a(3) then for a = 0111 (resp., 0011), b = 00001111, (resp.,
00000011) (see Figures 4.2(c) and (d)).
We now characterize the multicasts described above in terms of “ordered partitions.”
Recall from Section 3.1.2 that a k-partition pi, for any 1 ≤ k ≤ n, of a n-set S is a division
of S into k disjoint nonempty subsets, S0, S1, . . . , Sk−1. For any 1 ≤ k ≤ n, an ordered
40
k-partition ~pi of an n-set S is a k-partition {S0, S1, . . . , Sk−1} of S with an order (from 0 to
k − 1) imposed on the blocks. We denote this ordered partition by ~pi = 〈S0, S1, . . . , Sk−1〉.
In this notation, 〈S0, S1〉 6= 〈S1, S0〉.
Consider a multicast of bits a(z − 1), a(z − 2), . . . , a(1), a(0) to bits b(n − 1), b(n −
2), . . . , b(1), b(0). An ordered z-partition 〈S0, S1, . . . , Sz−1〉 of Zn = {0, 1, . . . , n − 1} rep-
resents this multicast if and only if for each 0 ≤ i < z, for all bit positions j ∈ Si, bit b(j)
gets its value from a(i).
For example, the multicasts of Figure 4.2(a),(b) and (c),(d) correspond to the ordered
4-partitions ~pi1 = 〈{7, 5, 3, 1}, {6, 2}, {4}, {0}〉 and ~pi2 = 〈{7, 6, 5, 4}, {3, 2}, {1}, {0}〉, respec-
tively.
4.1.1 Functional Description of the Mapping Unit
Consider a mapping unit that expands a z-bit signal U to the n-bit signal Q (see Figure 4.1);
note that the input B is explained later, while the parameter α is dependent on the imple-
mentation of the mapping unit and is explained in Chapter 5. As noted earlier, a multicast
can be represented as an ordered partition ~pi of Zn = {0, 1, . . . , n − 1}. Therefore, ~pi and
an instance u ∈ U of the z-bit input to the mapping unit uniquely specify an n-bit output
q ∈ Q.
For example, the ordered partition ~pi1 = 〈{7, 5, 3, 1}, {3, 2}, {1}, {0}〉 and u1 = 0111
of Figure 4.2(a) produces output q1,1 = 01010101. If u1 is replaced by u2 = 0011, then
the output is q1,2 = 00010001 (see Figure 4.2(b)). Similarily, if the ordered partition is
~pi2 = 〈{7, 6, 5, 4}, {3, 2, }, {1}, {0}〉, then the outputs corresponding to u1 and u2 are q2,1 =
00001111 (Figure 4.2(c)) and q2,2 = 00000011 (Figure 4.2(d)).
In general, the mapping unit uses several ordered partitions ~pi ∈ Y . The y-bit input B
of Figure 4.1 selects one of these ordered partitions; clearly Y ≤ 2y. Since input (set) B of
y-bit strings may be thought to be in one-to-one correspondence with Y , we can describe
the mapping unit MU (z,y,n,α), shown in Figure 4.1, by the following function µ.
µ : Z2z × Z2y → Z2n
41
Since “sets” U , B, Q are sets of z-bit, y-bit, and n-bit strings, respectively, we can also write
µ : U × B → Q.
In summary, MU (z,y,n,α) accepts as input a z-bit string (the source string) and an
ordered partition ~pi (one among 2y). It produces as output an n-bit string (subset of Zn).
The source string could assume any value from {0, 1}z. The set of 2y ordered partitions are
generally fixed (usually hardwired in the mapping unit or configured into a LUT internal to
the mapping unit).
4.1.2 Constructing Ordered Partitions for a Mapping Unit
Let S be a set of subsets of Zn that we wish a mapping unit to generate. This section details
a procedure for constructing a set of partitions that (along with a set of source string values)
generates all elements of S. (In the process, we may generate a set S ′ ⊇ S of subsets.)
Before we proceed a few definitions are needed.
A subset S ⊆ Zn induces a 1- or 2-partition piS = {S,Zn− S}. If S = ∅ or S = Zn, then
piS is the 1-partition {Zn}; otherwise, piS is a 2-partition. Clearly, the induced partition is
not unique for a given S, as piS = piZn−S = {S,Zn − S}. When S is represented by its n-bit
characteristic string, the induced partition piS places bit positions with the same value in the
same block of piS.
Let S = {S0, S1, . . . , Sk−1} be a set of subsets of Zn. For 0 ≤ i < k, let subset Si induce
partition pii. Define the partition induced by S to be piS = pi0pi1 . . . pik−1; the product of
partitions is defined in Section 3.1.2.
We now illustrate these ideas with an example.
Example 4.1 Consider the sets of subsets S0, S1, and S2 of Z8 shown in Table 4.1, where
Si = {Sij : 0 ≤ j < 4}, for 0 ≤ i < 3. Sets S0 and S1 represent two types of reduction, and
S2 is a set of “arbitrary” subsets of Z8.
For 0 ≤ i < 3 and 0 ≤ j < 4, let pii,j be the partition induced by subset Sij. Table 4.2
shows pii,j . Let set Si induce partition pii = pii,0pii,1pii,2pii,3. Then, we have
42
TABLE 4.1: Sets of subsets of Z8 for Example 4.1.
Sij S0 S1 S2
Si0 11111111 11111111 10100010
Si1 01010101 00001111 11111101
Si2 00010001 00000011 01011010
Si3 00000001 00000001 00000111
TABLE 4.2: Partition pii,j for subsets S
i
j of Table 4.1
Sij pi0,j pi1,j pi2,j
Si0 {{7,6,5,4,3,2,1,0}} {{7,6,5,4,3,2,1,0}} {{6,4,3,2,0},{7,5,1}}
Si1 {{7,5,3,1},{6,4,2,0}} {{7,6,5,4},{3,2,1,0}} {{1},{7,6,5,4,3,2,0}}
Si2 {{7,6,5,3,2,1},{4,0}} {{7,6,5,4,3,2},{1,0}} {{7,5,2,0},{6,4,3,1}}
Si3 {{7,6,5,4,3,2,1},{0}} {{7,6,5,4,3,2,1}{0}} {{7,6,5,4,3},{2,1,0}}
pi0 = {{7, 5, 3, 1}, {6, 2}, {4}, {0}}
pi1 = {{7, 6, 5, 4}, {3, 2}, {1}, {0}}
pi2 = {{7, 5}, {6, 4, 3}, {1}, {2, 0}}.
We now come back to a procedure that uses the given set S to generate a set of ordered
partitions and a source string value for a mapping unit to generate S.
1. Number the elements of S in some order so that S = {S0, S1, . . . , Sk−1}.
2. For each Si ∈ S, compute its induced partition piSi.
3. Starting from pi0, pick the largest integer ` such that piS0piS1 . . . piS`−1 has ≤ z blocks.
Let pi0 = piS0piS1 . . . piS`−1 .
4. Starting from pi`, pick the largest integer m such that piS`piS`+1 . . . piS`+m−1 has ≤ z
blocks. Let pi1 = piS`piS`+1 . . . piS`+m−1 .
5. Repeat this procedure till all induced partitions piSi have been included in some pij .
43
6. Convert each pij to an ordered partition ~pij using some arbitrary ordering of its blocks.
The ordered partitions ~pi0, ~pi1, . . . are the ones needed in the mapping unit.
We illustrate this procedure with the following example.
Example 4.2 Let S = S0∪S1∪S2 of Example 4.1, and let z = 4. Then, S = {S00 , S01 , S02 , S03 , S11 ,
S12 , S
2
0 , S
2
1 , S
2
2 , S
2
3}. The induced partitions corresponding to each Sij are in Table 4.2. Let
the order enumerated above be the order in which we consider the partitions. Then using
the above procedure, the partitions pi0, pi1, and pi2 are constructed as shown below.
pi0,0pi0,1 = {{7, 5, 3, 1}, {6, 4, 2, 0}}
pi0,0pi0,1pi0,2 = {{7, 5, 3, 1}, {6, 2}, {4, 0}}
pi0,0pi0,1pi0,2pi0,3 = {{7, 5, 3, 1}, {6, 2}, {4}, {0}} = pi0
pi1,1pi1,2 = {{7, 6, 5, 4}, {3, 2}, {1, 0}} = pi1
pi2,0pi2,1 = {{7, 5}, {1}, {6, 4, 3, 2, 0}}
pi2,0pi2,1pi2,2 = {{7, 5}, {2, 0}, {6, 4, 3}, {1}}
pi2,0pi2,1pi2,2pi2,3 = {{7, 5}, {2, 0}, {6, 4, 3}, {1}} = pi2
Order the partitions as ~pi0 = 〈{7, 5, 3, 1}, {6, 2}, {4}, {0}〉, ~pi1 = 〈{7, 6, 5, 4}, {3, 2},
{1, 0}〉, ~pi2 = 〈{7, 5}, {2, 0}, {6, 4, 3}, {1}〉. The mapping unit uses these ordered partitions
with the values of the source strings shown in Table 4.3 to generate each subset in S.
We note some interesting points from Examples 4.1 and 4.2. Note that, in general, we
will call the arbitrarily chosen set S as the set of independent subsets and denote |S| by λ.
The set of all subsets generatable by the ordered partitions and the source strings is the set
of all subsets denoted by S ′ with |S ′| = Λ.
• A subset can be generated in a variety of ways, as the same z-bit source string applied
to different ordered partitions can result in the same value. For example, the subset
S00 = 11111111 could be produced from any partition ~pik with the source string 1111.
44
TABLE 4.3: Mapping unit values used to produce the sets in Example 4.1
Sij u ∈ U ~pik q ∈ Q
S00 1111 〈{7, 5, 3, 1}, {6, 2}, {4}, {0}〉 11111111
S01 0111 01010101
S02 0011 00010001
S03 0001 00000001
S10 1111 11111111
S11 d011 〈{7, 6, 5, 4}, {3, 2}, {1, 0}〉 00001111
S12 d001 00000011
S13 0001 〈{7, 5, 3, 1}, {6, 2}, {4}, {0}〉 00000001
S20 1010 〈{7, 5}, {2, 0}, {6, 4, 3}, {1}〉 10100010
S21 1101 11111101
S22 0011 01011010
S23 0101 00000111
d indicates a don’t care value.
In addition, two different source strings applied to two differently ordered partitions
can result in the same value. For example, consider two orderings of partition ~pi0, ~pi
1
0 =
〈{7, 5, 3, 1}, {6, 2}, {4}, {0}〉, while ~pi20 = 〈{0}, {4}, {7, 5, 3, 1}, {6, 2}〉. Then the source
string 0111 with ~pi10 and the source string 1101 with ~pi
2
0 will both produce the same
subset, 01010101.
• A subset not in S can be produced. For example, using the z-string 1010 with the
ordered partition ~pi0 produces the subset 10111010.
• Subsets and their induced partitions may be repeated. For example, subsets S03 and
S13 of Example 4.1 are equal. While the procedure ignores repeated subsets and their
induced partitions in generating ordered partitions, partitions corresponding to classes
of algorithms or specific applications may benefit from repeating subsets.
45
• A partition with fewer than z blocks, such as ~pi1, results in “don’t care” values (d) for
the bits not corresponding to any block in the partition. Thus, the subset S11 with
source string d011 may be produced from the z-string 0011 or 1011.
• In the procedure, a different sequence of considering the induced partitions pii,j can
produce a different set or number of ordered partitions. For example, if the in-
duced partitions were considered in reverse order, that is, starting with pi2,3, then
pi2,2, etc., such that pi0 = pi2,3pi2,2pi2,1 . . ., the set of partitions would result in ~pi0 =
〈{7, 5}, {2, 0}, {6, 4, 3}, {1}〉, ~pi1 = 〈{7, 6, 5, 4}, {3, 2}, {1}, {0}〉, and ~pi2 = 〈{7, 5, 3, 1},
{6, 2}, {4, 0}〉.
• The conversion of an unordered partition to an ordered partition can be done in as many
z! ways. Some of these may be more advantageous than others. An ordering that results
in common source strings used to produce the subsets of Si and Sk (corresponding to
different ordered partitions) can be useful when the mapping unit is used as part of
a larger design. This is because the same z-bit source strings can be used to produce
both Si and Sk. Table 4.4 demonstrates two ordered partitions for S0 and S1, resulting
in two sets of source strings for each set. Note that two of the sets of source strings,
one for S0 and one for S1, are the same.
4.2 Number of Subsets Produceable by MU (z,y,n,α)
In the procedure of Section 4.1.2, it is not clear how many ordered partitions are produced,
except that it is at most |S|. In this section we answer some natural questions that arise in
this context. For this discussion, assume a mapping unit MU (z,y,n,α) and an independent
set S ′ of subsets of Zn.
Question 1: If the 2y ordered partitions ofMU (z,y,n,α) have not been fixed, how large can
the independent set S be?
Question 2: If all 2y ordered partitions of MU (z,y,n,α) have been fixed, how large can the
independent set S be?
46
TABLE 4.4: Two different orderings for the partitions of sets S0 and S1 in Example 4.1
resulting in different sets of source strings used to produce the subsets in each set.
Sij ~pi z-bit value needed Q
S00 〈{7, 5, 3, 1}, {6, 2}, {4}, {0}〉 1111 11111111
S01 0111 01010101
S02 0011 00010001
S03 0001 00000001
S00 〈{4}, {6, 2}, {7, 5, 3, 1}, {0}〉 1111 11111111
S01 1101 01010101
S02 1001 00010001
S03 0001 00000001
S10 〈{7, 6, 5, 4}, {3, 2}, {1}, {0}〉 1111 11111111
S11 0111 00001111
S12 0011 00000011
S13 0001 00000001
S10 〈{3, 2}, {0}, {7, 6, 5, 4}, {1}〉 1111 11111111
S11 1101 00001111
S12 0101 00000011
S13 0100 00000001
Question 3: If the 2y ordered partitions ofMU (z,y,n,α) have not been fixed, how large can
the total set S ′ be?
We now address these questions in this section.
4.2.1 Number of Independent Subsets
We first consider the case where the ordered partitions have not been fixed.
Lemma 4.1 For any k ≥ 1, let {S0, S1, . . . , Sk−1} be a set of subsets of Zn. For each
0 ≤ i < k, let Si induce a partition pii. Then pi0pi1 . . . pik−1 has at most 2k blocks.
47
Proof: Each pii has at most 2 blocks. Each product divides an existing block into at most
two “sub-blocks”. Therefore, over k − 1 products we have at most 2 · 2k−1 = 2k blocks.
Remark: A more formal proof can be constructed by induction on k.
TABLE 4.5: A set of log z subsets of Z16, where the number of blocks induced by the product
of the partitions of the subsets has z = 8 blocks.
Si pii
0101010101010101 {{15, 13, 11, 9, 7, 5, 3, 1}, {14, 12, 10, 8, 6, 4, 2, 0}}
0011001100110011 {{15, 14, 11, 10, 7, 6, 3, 2}, {13, 12, 9, 8, 5, 4, 1, 0}}
0000111100001111 {{15, 14, 13, 12, 7, 6, 5, 4}, {11, 10, 9, 8, 3, 2, 1, 0}}
Table 4.5 illustrates a set S of subsets, |S| = 3, whose ordered partition meets the upper
bound on the number of blocks given by Lemma 4.1. As shown below, the ordered partition
~pi resulting from pi0pi1pi2 has z = 8 blocks.
pi0 = {{15, 13, 11, 9, 7, 5, 3, 1}, {14, 12, 10, 8, 6, 4, 2, 0}}
pi0pi1 = {{15, 11, 7, 3}, {14, 10, 6, 2}, {13, 9, 5, 1}, {12, 8, 4, 0}}
pi0pi1pi2 = {{15, 7}, {14, 6}, {13, 5}, {12, 4}, {11, 3}, {10, 2}, {9, 1}, {8, 0}}

~pi
Theorem 4.1 Let S be an independent set of MU (z,y,n,α). Let 2yblog zc ≤ 2z. If the
partitions of MU (z,y,n,α) have not been fixed, then |S| = λ ≥ 2yblog zc.
Proof: By Lemma 4.1, a collection of blog zc subsets induces a partition with at most
2blog zc ≤ z blocks. Thus, as many as 2yblog zc arbitrarily selected subsets can be included in
S, using 2y partitions, each with ≤ z blocks. Also, since 2yblog zc ≤ 2z, there is no constraint
on whether an appropriate source string is available for generation of a given subset.
Now we address Question 2, namely, the number of independent subsets that can be
generated if the partitions are fixed.
Theorem 4.2 Let S ′ be an independent set of subsets of Zn. For any z, y, n such that
z + y ≤ n, and for a mapping unit MU(z,y,n,α) with fixed partitions, |S| = λ = 0.
48
Proof: Since z + y ≤ n, 2z2y < 2n. Therefore there is at least one subset belonging to
℘ (Zn) that cannot be generated from the 2z possible source strings and 2y partitions that
are inputs to the mapping unit.
Note that the number of independent subsets of Zn produced by a mapping unit does
not include what is possible under reconfiguration; however, the above theorem establishes
the usefulness of mapping units with configurable partitions, explored in Chapter 5. This
leads us to the following definition.
Definition 4.1 A mapping unit MU(z,y,n,α) is universal if and only if it can, under re-
configuration, produce any set of 2y log z arbitrarily selected subset of Zn.
4.2.2 Total Number of Subsets
We now address the question of how many subsets (not necessarily independent) can be
generated by MU (z,y,n,α) (using partitions with ≤ z blocks) for any source string u and
any ordered partition ~pii. In general, one could construct a mapping unit that produces the
same subset, regardless of the input. So instead of addressing the question of the minimum
number of sets that MU (z,y,n,α) can produce, we derive a lower bound on the maximum
number of distinct subsets MU (z,y,n,α) can produce.
Recall from Section 4.1.1 that the output ofMU (z,y,n,α) is given by the function µ(u, ~pi).
The following Lemma describes the output of the mapping unit for any two source strings
and a single ordered partition.
Lemma 4.2 For any ordered z-partition ~pi and any pair of distinct source strings u1, u2, the
outputs µ(u1, ~pi) 6= µ(u2, ~pi).
Proof: Since u1 6= u2, there exists an i (0 ≤ i < z) such that bit u1(i) 6= u2(i). Since ~pi is a
z-partition, every bit of the source string is used by ~pi. If ~pi = {B0, B1, . . . , Bz−1} then all bits
of Bi are assigned to u1(i) in µ(u1, ~pi), which differs from the value(s) assigned in µ(u2, ~pi).
We now extend this idea to a set of Y ordered partitions, where Y ≤ 2d nz−1e−1. Since the
quantity
⌈
n
z−1
⌉
− 1 will be used extensively in this section, we let χ =
⌈
n
z−1
⌉
− 1.
49
Divide the n-bits of MU (z,y,n,α) into χ + 1 buckets of at most (z − 1) contiguous bits.
If any bucket has fewer than z − 1 bits, then make that the rightmost bucket. Specifically,
for 1 ≤ i ≤ χ, bucket Bi contains indices αi to βi, where
αi = n− (χ− i+ 1)(z − 1) and βi = n− (χ− i)(z − 1)− 1.
Bucket B0 (the rightmost bucket) ranges from bit α0 = 0 to bit β0 = n − χ(z − 1) − 1.
)z(−2n
−1 )z(n − −1
−1 )z −1 − + i−1)
−1 )z( −1( )χn − − i
(
... ...
α
χ
−1χB B
0
............
−1n
...
β:
n −α: n
βχ
−1
B0B1Bi
−1z < −1z−1z
i α0β0αχβχ −1χ β1 α1βi α
−1
n − −1 −1 )z(χ −1−n
−1 )z((χ
χ
z−1z
−1 )z(χ−n
−1 )z(−1( )
FIGURE 4.3: Division of an n-bit quantity into χ+1 buckets of at most (z− 1) contiguous
bits
Figure 4.3 illustrates this. Thus, for 1 ≤ i ≤ χ, Bi has βi−αi+1 = z−1 indices and B0 has
β0 −α0 + 1 = β0 +1 = n− χ(z − 1) = n−
(⌈
n
z−1
⌉
− 1
)
(z − 1) = n+ (z− 1)−
⌈
n
z−1
⌉
(z− 1).
We now specify ~pi by assigning each bit of each bucket Bi to a bit of a source string u. Let
m = m(χ−1)m(χ−2) . . .m(1)m(0) be a χ-bit number. Writing m as a χ+1 =
(⌈
n
z−1
⌉)
-bit
number we have
m = m(χ)m(χ− 1) . . .m(1)m(0),
where m(χ) = 0. The above binary representation of m induces an ordered partition ~pi as
follows.
Recall that for any 0 ≤ i ≤ χ, the bits of bucket Bi are βi, βi − 1, . . . , αi + 1, αi. If
m(i) = 0, then multicast u(z − 1) (the most significant bit of the source string u) to all bits
βi, βi − 1, . . . , αi + 1, αi of Bi. If m(i) = 1, then assign u(z − 2) to βi, u(z − 3) to βi − 1,
u(z − 4) to βi − 2 and so on; if i = 0 and B0 has fewer than z − 1 bits, the last few bits
of u are not used. Figure 4.4 shows the manner in which source string bits are assigned to
bucket indices. It should be clear that two χ-bit numbers m, m′ will induce two different
50
u2iβ 2iα
(0)(1)u
= 1= 0
u )( −4z
1i
β 1iα
u
)2i(m
2i
B
1i
B
m
Source string
Buckets
...
......
m
)( −3zu )( −2zu )( −1z
)1i(
FIGURE 4.4: Assignment of source string bits to bucket indices
ordered partitions ~pi, ~pi′ if and only if m 6= m′ (as the bit where m and m′ differs will cause
a different multicast in the two cases). Since we have 2χ distinct values for m, 2χ distinct
ordered partitions may be created as described above.
Lemma 4.3 Let ~pi1, ~pi2 be any two ordered partitions created from χ-bit integers m1, m2
(as described above). Then for any (not necessarily distinct) source strings u1, u2, where
0 < u1, u2 < 2
z − 1, we have µ(u1, ~pi1) 6= µ(u2, ~pi2).
Proof: Since ~pi1 6= ~pi2, we have m1 6= m2. Let m1(i) 6= m2(i) for some 0 ≤ i < χ. Without
loss of generality, let m1(i) = 0 and m2(i) = 1. Figure 4.5 shows how u is mapped to bucket
Bi of ordered partitions ~pi1 and ~pi2. We now consider two cases.
Case 1: There is some u2(`) (where 0 ≤ ` < z−1) that is different from u1(z−1). Without
loss of generality, let u1(z − 1) = 0 and u2(`) = 1. Then bucket Bi of µ(u1, ~pi1) has all
0’s whereas the bucket of µ(u2, ~pi2) has a 1 in the position corresponding to u2(`).
Case 2: u1(z − 1) = u2(z − 2) = u2(z − 3) = . . . = u2(1) = u2(0) 6= u2(z − 1). Without
loss of generality, let u1(z − 1) = 0 and u2(z − 1) = 1. Consider bucket Bχ. Since
m1(χ) = m2(χ) = 0, bucket Bχ of µ(u1, ~pi1) has all 0’s whereas the same bucket for
µ(u2, ~pi2) has all 1’s.
51
( )u u(1) u(0)
βi
βi
α i
α i
pi2
pi1
−4
...
...
...
...
z−2( )uz−1( )u
z−3( )u z
FIGURE 4.5: Mapping of a source string to bucket Bi under two different ordered partitions
~pi1, ~pi2
In either case, µ(u1, ~pi2) 6= µ(u2, ~pi2).
We now put Lemmas 4.2 and 4.3 together to derive the main result for Question (3)
(from page 46).
Theorem 4.3 For integers n, z ≥ 2 and χ =
⌈
n
z−1
⌉
− 1, there exists a mapping unit that
accepts C values from the set {u : 0 < u < 2z − 1} as source strings and one of Y ≤ 2χ
ordered partitions that produces CY distinct subsets.
Proof: Construct Y partitions as shown earlier from a set of χ-bit numbers. Consider
source string(s) u1, u2 and ordered partitions ~pi1, ~pi2. If u1 = u2 and ~pi1 = ~pi2 then clearly
µ(u1, ~pi1) = µ(u2, ~pi2).
If u1 6= u2 then by Lemmas 4.2 and 4.3 µ(u1, ~pi1) 6= µ(u2, ~pi2). If u1 = u2, but ~pi1 6= ~pi2,
then again by Lemma 4.3 µ(u1, ~pi1) 6= µ(u2, ~pi2).
Thus, under the conditions laid out in the theorem, if the ordered pairs 〈u1, ~pi1〉 and
〈u2, ~pi2〉 (or inputs to the mapping unit) are distinct, then so are the outputs of the mapping
unit. So the number of distinct outputs equals the number of distinct inputs, which is CY .
52
Remark: In general, C can be as large as 2z − 2 and Y can be as large as 2y provided
y <
⌈
n
z − 1
⌉
. So in this case, 2y(2z − 2) subsets can be produced.
The above theorem shows the existence of a set of Y ordered partitions for which a large
number of subsets can be produced. Actually, this is a “class” of sets of Y ordered partitions.
Clearly we need not set m(z − 1) to 0; any bit of m can be fixed at 0 or 1. Additionally,
the buckets need not be as stated, as any fixed permutation of the n bits into buckets of
“equal size” would be equivalent. The fixing of one bucket (Bχ in our construction) was
needed to avoid partitions based on integers m and m′, where the binary representations
of m and m′ are complements of each other. Including both m, m′ will make the proof of
Lemma 4.3 incomplete. However, the same effect of avoiding “complementary” partitions
can be obtained by restricting source strings to be non-complementary. Other more fine-
tuned observations can be made for specific cases. Thus, while the set of CY subsets that
can be created as described is somewhat more restricted than those in Theorem 4.1 (where
the subsets are independent), the restriction is not nearly as severe as Theorem 4.3 seems to
imply.
In the next chapter, realizations of the mapping unit are presented.
53
Chapter 5
The Mapping Unit: Realizations
In the previous chapter, a mapping unitMU (z,y,n,α) (Figure 5.1) was described as a decoder
that accepts as input a source string of z-bits (given by a u ∈ U) and an ordered partition
~pi of an n-set with at most z blocks (selected by a b ∈ B). Using the operation µ (described
in Section 4.1.1) the mapping unit MU (z,y,n,α) produces an n-bit string. In this chapter,
we present several realizations of the mapping unit and detail their operation.
MU (z,y,n,α)
z
y
B
U
 
  -
 
 
6
Q
n -
 
 
FIGURE 5.1: Block diagram of a mapping unit MU (z,y,n,α)
We first provide a classification of the mapping units in this chapter (see Figure 5.2). A
Mapping Unit
UniversalGeneral
ReconfigurableFixed
Bit−Slice
UniversalGeneral
ReconfigurableFixed
Integral
FIGURE 5.2: Classification of mapping unit realizations
mapping unit MU (z,y,n,α) can be integral (by default) or bit-slice. An integral mapping
unit generates all n output bits simultaneously and (for reasons explained below) has α = 1.
A bit-slice mapping unit, on the other hand, generates the n output bits in α rounds; i.e.,
54
n
α
-bits at a time. One could view the integral mapping unit as a bit-slice mapping unit
with α = 1. The default for a mapping unit is the integral attribute. Another way to
categorize mapping units (both integral and bit-slice) is in terms of whether they are fixed
or reconfigurable (that is, based on whether they can be configured off-line to alter their
behavior). Reconfigurable mapping units can be general (default) or universal. In informal
terms, a universal mapping unit can produce any subset. It was established in Theorem 4.2
that fixed mapping units cannot be universal. Later in this chapter we show that there
exists a universal reconfigurable mapping unit. However, it is not known whether or not all
reconfigurable mapping units are universal. The “general” attribute should be interpreted
as “not known if universal.”
We begin our discussion of mapping unit realizations with the simplest class, fixed map-
ping units, explored in Section 5.1. We then describe a more flexible class, reconfigurable
mapping units, in Section 5.2. Finally we conclude this chapter with bit-slice mapping units
that use some of the results of Sections 5.1 and 5.2. The various mapping unit implementa-
tions will be used in Chapter 6 in the construction of a configurable decoder.
5.1 Fixed Mapping Units
The basic strategy of the fixed mapping unit (FMU) is to hardwire connections according
to each ordered partition (multicast), superimposing these connections through a set of
multiplexers, and using the y-bit signal b to select the multiplexer output. For example, let
z = 4, y = 1, and n = 8. Then there are 2y = 2 ordered partitions mapping the 4 source
string bits to the 8 output bits. Let the mappings be as shown in Figure 4.2(a),(b) and (c),(d)
(page 40), which produce the sets of subsets S0 and S1 from Table 4.1 (see Example 4.1,
page 43). The resulting FMU is shown in Figure 5.3. Notice that if input signal B = 0, then
U(0) is connected to Q(0), U(1) is connected to Q(4), U(2) to Q(2) and Q(6), and U(3) to
Q(1), Q(3), Q(5), and Q(7). This matches the connections shown in Figure 4.2(a) and (b).
Similarily, verify that where B = 1, the resulting connections match those of Figure 4.2(c)
and (d).
The general structure of an FMU is shown in Figure 5.4. How signal U is fanned out
55
1
B
U
1
0
(7)Q
(6)Q
(5)Q
(4)Q
(3)Q
Q(2)
(1)Q
Q(0)
(3)
(2)
(1)
(0)
U
U
U
FIGURE 5.3: A fixed mapping unit MU (4,2,8,1) that produces S0 and S1 in Table 4.1
56
Qy
y
y
y
U
z
Q
MUX
MUX 2
MUX 1
MUX 0
...
−1n210 BBBB
y
B
(2)
(1)
(0)
Q
Q
)−1n(
.
.
.
.
.
..
.
.
y2
y2
y2
y2
( )−1n
FIGURE 5.4: General structure of a fixed mapping unit; signals B0, B1, . . . , Bn−1 are dis-
cussed later
to the various multiplexers depends on the 2y ordered partitions used. In general, each
multiplexer receives 2y bits, so the z bits of U are collectively fanned out to n2y places. We
begin the construction of the cost of a fixed mapping unit with the following theorem.
Theorem 5.1 A fixed mapping unit MU (z,y,n,α) can be realized as a circuit with a gate
cost of O(ny2y) and a delay of O(y + log n).
Proof: The cost of the FMU is the summation of the costs of its internal building blocks.
From Figure 5.4, the building blocks consist of n multiplexers and the fan-out of the signals
U and B. By Lemma 3.3, the n multiplexers, each with 2y inputs, can be realized as circuits
with an overall gate cost of O(ny2y) and a delay of O(y).
57
The fan-out of signal B has degree n and width y. By Lemma 3.1, it has a gate cost of
O(ny) and a delay of O(logn) (Lemma 3.1). As observed earlier, the z-bit signal U is fanned
out to n2y multiplexer inputs. If bit i (0 ≤ i < z) of U is fanned out to ni places, then its
delay is logni = O(logn) and its cost is O(ni). The total delay is O(y+ log n) and the total
cost is O
(
z−1∑
i=0
ni
)
= O(n2y).
The overall delay and cost of the FMU is thus O(y + log n + y + logn) = O(y + log n)
and O(ny2y + ny + n2y) = O(ny2y).
In general, there is no relationships between the values z and y. Figure 5.3 illustrates a
case where z > 2y. Figure 5.5 illustrates a case where z = 2y. Note that if z = 2y, then
the number of inputs of each MUX is z (as shown in Figure 5.5), implying a gate cost of
O(z log z) for each of the n multiplexers.
As an example of these fixed mapping units, consider the sets shown in Table 5.1, where
sets S0, S1, and S3 are the sets of subsets from Example 4.1, while set S2 is a set of subsets
whose ordered partitions satisfy the constraints imposed by the construction for Theorem 4.3.
We have used an intelligent ordering (see Table 4.4, page 47) of the partitions of S0 and S1,
as a result of which the z-bit source strings of U required to produce the subsets of S0 and
S1 are the same. This reduces the number of rows needed to store the values in a LUT
preceding the mapping unit (see Figure 3.15 and Chapter 6) in the configurable decoder.
Note that since S3 contains three blocks in its partition, the most significant bit of the z-bit
source strings that produce the subsets of S3 have a “don’t care” value d.
Figure 5.3 illustrates an implementation of the FMU that can produce the sets S0 and
S1 of Table 5.1, as 2y = 2. The FMU of Figure 5.5 can produce all subsets in Table 5.1,
as z = 2y = 4. Note that in each implementation, the first input to a MUX corresponds to
the ordered partition to produce S0, the second input to a MUX corresponds to the ordered
partition to produce S1, etc. Thus, to produce S32 in the FMU, input signal U would have a
value of 0101 and input signal B would have a value of 11.
58
(1)
(2)Q
Q(3)
Q(4)
Q(5)
Q(6)
Q(7)
2
B
0
3
2
1
Q
U
U
U
U
(0)
(1)
(2)
(3)
(0)Q
FIGURE 5.5: A fixed mapping unit MU (4,4,8,1) that produces all subsets in Table 5.1
59
TABLE 5.1: Sets of n-subsets (n = 8, z = 4) used for fixed mapping units in Figures 5.3
and 5.5
Sij q ∈ Q ~pii ∈ Y u ∈ U
S00 11111111 〈{7, 5, 3, 1}, {6, 2}, {4}, {0}〉 1111
S01 01010101 0111
S02 00010001 0011
S03 00000001 0001
S10 11111111 〈{7, 6, 5, 4}, {3, 2}, {1}, {0}〉 1111
S11 00001111 0111
S12 00000011 0011
S13 00000001 0001
S20 00 01 01 00 〈{7, 6, 1, 0}, {4, 2}, {5, 3}〉 d010
S21 00 10 10 00 d001
S22 00 11 11 00 d011
S23 11 00 00 11 d100
S24 11 01 01 11 d110
S25 11 10 10 11 d101
S30 10100010 〈{7, 5}, {6, 4, 3}, {2, 0}, {1}〉 1001
S31 11111101 0110
S32 01011010 0101
S33 01011101 1110
d indicates a don’t care value.
60
5.2 Reconfigurable Mapping Units
By Theorem 4.2 (page 48), when the ordered partitions of a mapping unit are fixed, certain
subsets cannot be produced. Here, we seek to provide a means to change the ordered
partitions off-line in a “reconfigurable mapping unit.”
A reconfigurable mapping unit (RMU) (Figure 5.6) allows the set Y of ordered partitions
Mapping Unit
LUT
nz
y
ny
Fixed
B
U Q
Mapping Unit
Reconfigurable
FIGURE 5.6: A reconfigurable mapping unit MU (z,y,n,α)
to be changed off-line. While Y may not be totally arbitrary, a degree of flexibility is allowed
that is not seen in the fixed mapping units of Section 5.1.
The flexibility of the RMU comes from a 2y × ny LUT (that is, a LUT with 2y rows
and a word size of ny) called a “configuration LUT.” The output of the configuration LUT
generates the FMU signal shown as B0, B1, . . . , Bn−1 in Figure 5.4. The main advantage of
the RMU is that it can control the signals B0, B1, . . . , Bn−1 at will. In contrast, Bi = B, for
each 0 ≤ i < n in the FMU. We first derive the cost and delay of an RMU.
Theorem 5.2 A reconfigurable mapping unit MU (z,y,n,α) can be realized as a circuit with
a gate cost of O(ny2y) and a delay of O(y + log n).
Proof: By Theorem 5.1, the FMU has a gate cost of O(ny2y) and a delay of O(y + log n).
This gate cost would be unchanged even if the fan-out of B is ignored. By Lemma 3.4, a
61
2y × ny LUT has a gate cost of O(2y(y + ny)) = O(ny2y) and a delay of O(y + log (ny)) =
O(y+logn). The overall gate cost of the reconfigurable mapping unit is thus O(ny2y) while
the overall delay is O(y + log n).
As an example of the functionality of an RMU, consider the FMU with z = 2y of Fig-
ure 5.3, which implemented all four sets of subsets in Table 5.1. If an RMU was used to
implement the same set of subsets using the same wiring of the signal U to the nmultiplexers,
then Table 5.2 shows the contents of the configuration LUT of this RMU. Note that the LUT
TABLE 5.2: Configuration LUT words to produce the subsets from Table 5.1
Address n log z-bit word Set
b ∈ B in LUT Si
00 00 00 00 00 00 00 00 00 S0
01 01 01 01 01 01 01 01 01 S1
10 10 10 10 10 10 10 10 10 S2
11 11 11 11 11 11 11 11 11 S3
contents specify an ordered partition corresponding to a set of subsets, and not the subset
itself. For example, when b = 00 the LUT word is 00 00 00 00 00 00 00 00 corresponding to
the ordered partition ~pi0 for set S0 (see Tables 5.1 and 5.2). Then with u = 0111, we have
µ(u, ~pi0) = 01010101. Similarily, with u = 0011, we have µ(u, ~pi0) = 00010001. Thus, in this
illustration b = 00 corresponds only to the ordered partition ~pi0 for S0.
There are two important properties of the reconfigurable mapping unit that can be seen
from this example. The first is that from a perspective outside of the mapping unit, nothing
changes between a fixed mapping unit and a reconfigurable mapping unit; that is, to produce
a desired subset Sij, the same values are neeeded for signals U and B in a reconfigurable
mapping unit as they are in a fixed mapping unit. The second is that each “grouping” of the
log z-bits (each corresponding to a particular MUX) in the n log z-bit words has the same
value in an FMU; this does not have to be the case in an RMU. For example, a word in the
LUT illustrated in Table 5.2 could have the value 00 01 10 11 00 01 10 11; this would imply
62
that bit 7 of the 8-bit output would be derived from ~pi0, bit 6 would be derived from ~pi1, etc.
Using the ordered partitions presented in Table 5.1, a word in the LUT with the value 00
01 10 11 00 01 10 11 would result in the partition ~pi = 〈{7, 6, 3, 1}, {4, 2}, {0}, {5}〉. Not all
sets of subsets can be generated by the RMU however, as fixing the multicasts of the bits of
U to the n MUXs may preclude certain subset considerations.
A Universal Reconfigurable Mapping Unit: One particular case of the RMU requires
further elaboration. When z = 2y, we may broadcast U to all multiplexers; that is, with
suitable reconfiguration of the configuration LUT, each of the n-bits of the output signal Q
can be mapped to any of the bits of the source string signal U . This RMU is a universal
mapping unit (see Definition 4.1 on page 49).
Theorem 5.3 A universal reconfigurable mapping unit MU (2y,y,n,α) can be realized as a
circuit with gate cost O(ny2y), a delay of O(y + logn), and with suitable reconfiguration of
its configuration LUT, can produce any set S ∈ ℘ (Zn) of λ = y2y independent subsets of
Zn.
Proof: The cost of the mapping unit is given by Theorem 5.2. Since all source string bits
U(i) are hardwired to all output places (that is, the n MUXs corresponding to the n-bit
output Q), every output Q(j) can be set to any input bit U(i) by a y-bit grouping in the
ny-bit word of the configuration LUT. This implies that every output bit can be placed in
any block in an ordered partition ~pik. Since up to 2
y arbitrary partitions can be represented
in this way, by Theorem 4.1, a total of λ = y2y = 2y log z independant subsets can be pro-
duced by a single set of values in the configuration LUT.
Remark: As noted in Theorem 4.2, a fixed mapping unit that has its partitions hardwired
cannot produce any independent subsets. However, since any partition can be realized in
the universal reconfigurable mapping unit through reconfiguration of its LUT, then any set
of independent subsets can be realized in a single instance of the values in its LUT. As noted
in Theorem 4.1, 2y blog zc is the best possible number of independent subsets.
63
Reconfiguration of a Reconfigurable Mapping Unit: While it is clear that the
universal reconfigurable mapping unit can represent any set of partitions through reconfig-
uration, it is not clear if this is true for any reconfigurable mapping unit. We now address
the question of which sets of subsets an RMU can generate. As we have not been able to
construct all aspects of a proof, we present some of our observations as a conjecture. Before
we proceed, we pin down some terms.
Recall that an RMU hardwires bits of a source string u to the MUX inputs. For 0 ≤
i < 2y, let the ith ordered partition pattern be the ordered partition resulting from setting all
MUX controls to i. Denote the ith ordered partition pattern by ~σi. An RMU has 2
y fixed
ordered partition patterns (as does the FMU). Unlike the FMU, however, the RMU can
address each MUX individually, thereby using parts of different ordered partition patterns
simultaneously (as demonstrated previously). Nevertheless, the existence of these hardwired
patterns imposes certain restrictions on the type of sets of subsets that can be produced by
the RMU.
As an example, consider the partitions hardwired in the mapping unit according to
Figure 5.3. If the contents of the configuration LUT are as specified by Table 5.2, then the
partition patterns for the RMU are given in Table 5.3 (which are the same as if the mapping
TABLE 5.3: Ordered partition patterns for an RMU resulting from the configuration LUT
words of Table 5.2 and the hardwiring shown in Figure 5.3.
Address n log z-bit word Ordered partition
b ∈ B = i in LUT pattern ~σi
00 00 00 00 00 00 00 00 00 〈{7, 5, 3, 1}, {6, 2}, {4}, {0}〉
01 01 01 01 01 01 01 01 01 〈{7, 6, 5, 4}, {3, 2}, {1}, {0}〉
10 10 10 10 10 10 10 10 10 〈{7, 6, 1, 0}, {4, 2}, {5, 3}〉
11 11 11 11 11 11 11 11 11 〈{7, 5}, {6, 4, 3}, {2, 0}, {1}〉
unit was a fixed mapping unit). However, as noted earlier, a word in the LUT with the value
00 01 10 11 00 01 10 11 would result in the partition ~pi = 〈{7, 6, 3, 1}, {4, 2}, {0}, {5}〉
64
Let {~σi : 0 ≤ i < 2y} be the set of ordered partition patterns of RMU MU (z,y,n,α). For
each 0 ≤ i < 2y, let ~σi = 〈T ij : 0 ≤ j < z〉. For example, consider the configuration LUT
word 01 01 01 01 01 01 01 01 corresponding to i = 1. Then the resulting ordered partition
pattern ~σ1 has blocks T
1
0 = {0}, T 11 = {1}, T 12 = {3, 2}, and T 13 = {7, 6, 5, 4}. Likewise, if the
configuration LUT word corresponding to i = 2 were 00 01 10 11 01 10 11, then the ordered
partition pattern ~σ2 has blocks T
1
0 = {5}, T 11 = {0}, T 12 = {4, 2}, and T 13 = {7, 6, 3, 1}. Note
that if some bit u(k) of the source string U is not used in ~σi, then T
i
k = ∅ (for example,
denoted by d in Table 5.1).
For each 0 ≤ j < z, define set
Mj =
2y−1⋃
i=0
T ij . (5.1)
Set Mj is the set of all bit positions of the output to which source string bit u(j) can
contribute.
As an illustration of the construction of the sets Mj , consider the ordered partition
patterns in Table 5.3. Then,
M0 =
2y−1⋃
i=0
T 0j = {0} ∪ {0} ∪ {5, 3} ∪ {1} = {5, 3, 1, 0}
M1 =
2y−1⋃
i=0
T 1j = {4} ∪ {1} ∪ {4, 2} ∪ {2, 0} = {4, 2, 1, 0}
M2 =
2y−1⋃
i=0
T 2j = {6, 2} ∪ {3, 2} ∪ {7, 6, 1, 0} ∪ {6, 4, 3} = {7, 6, 4, 3, 2, 1, 0}
M3 =
2y−1⋃
i=0
T 3j = {7, 5, 3, 1} ∪ {7, 6, 5, 4} ∪ ∅ ∪ {7, 5} = {7, 6, 5, 4, 3, 1}.
Recall that the configuration LUT has ny bit words, each consisting of n, y-bit controls,
one per MUX. We now correlate LUT values with Mj . Let k ∈ Mj . This implies that
source string bit u(j) goes to MUX k. Let this bit go to input α(`, k) of MUX k (it may
go to multiple inputs of MUX k). Then if a LUT word has the y-bit control value α(`, k)
corresponding to MUX k, then using this word guarantees that output bit q(k) gets its value
from source string bit u(j) according to the hardwired partition ~pi`.
65
Let S be a set of subsets of Zn and let S induce the partition piS = {B0, B1, . . . , Bz−1};
see Section 4.1.2 for the definition of an induced partition. We assume that S has z′ ≤ z
blocks.
Theorem 5.4 Consider a mapping unit MU(z,y,n,α) with set {~σi : 0 ≤ i < 2y} of ordered
partition patterns and sets Mj defined by Equation 5.1. A set S with unordered z-partition
piS of subsets of Zn can be realized on the mapping unit if there exists an injection f :
{0, 1, . . . , z − 1} → {Mj : 0 ≤ j < z} such that each Bi ∈ piS satisfies Bi ⊆Mf(i).
Proof: For each Bi ∈ piS , all elements of Bi are either present, or all absent in any subset in
S. Therefore all bit positions represented as elements belonging to Bi must have the same
value, and must come from a single source string bit in U (as the partition piS already has the
maximum number of allowed blocks z). Assume that there exists an Mj such that Bi ⊆Mj .
Recall that for each element a ∈Mj , there exists a partition that specifies a connection from
U(j) to Q(a), that is, MUX a. Thus, if Bi ⊆ Mj , there exists connections (specified by
ordered partitions hardwired in the mapping unit) that connect j to all elements k ∈ Bi,
that is, the outputs Q(k). This implies that if for all blocks Bi, there exists an order such
that each Bi ⊆Mf(i), that is, each Bi is a subset of a different setMf(i), then there is a hard-
wired connection in the mapping unit from input bit U(f(i)) to all elements in Bi, for all i.
Conjecture 5.1 We also conjecture that the converse is true. If there exists a block Bi of
z-partition piS that is not a subset of any Mj, then all elements belonging to Bi must be
derived from the same source string bit of U . However, the fact that there is no set Mj
containing all elements of Bi implies that there is no hardwired connection in the mapping
unit (for all ordered partitions hardwired in the mapping unit) that maps a single input bit
to all elements of Bi. Thus, the set of subsets S with the z-partition piS cannot be produced.
Remark: Note that if the partition does not have z blocks, the conjecture assuredly does
not hold true; for example, a single subset has a 1- or 2-partition (which is not necessarily a
subset of any set Mj) but is producible under many partitions. Essentially, this is because
66
although all bits in a partition with less than z blocks must be the same, more than one
source string bit can map the values of a single block in such a case. The above formulation
depends on the assumption that all elements in a block Bi must come from a single source
bit. Because of this, it is difficult to characterize any set of subsets with a partition consisting
of fewer than z blocks with the above formulation.
5.3 Bit-Slice Mapping Units
In this section, we consider a bit-slice mapping unit MU (z,y,n,α), that is, a mapping unit
with α > 1 but with α polylogarithmically bounded in n. A bit-slice mapping unit generates
just part of the output subset (represented by an n-bit string) at a time. It constructs a
subset over α iterations, generating n
α
bits in each iteration. This allows the mapping unit to
exploit repeated patterns, such as these demonstrated in Table 5.4, representing two forms
TABLE 5.4: Subsets with repeated patterns for n = 16, α = 4
Subset S Repeated Patterns
1111111111111111 1111
0001000100010001 0001
0000000100000001 0000, 0001
0000000000000001 0000, 0001
0000000011111111 0000, 1111
0000000000001111 0000, 1111
0000000000000011 0000, 0011
0000000000000001 0000, 0001
of reduction. Notice that to generate 8 strings, each 16-bits, only 6 strings, each 4-bits, need
to be generated. For example, the subset S = 0001000100010001 can be constructed over
4 iterations using the bit pattern 0001. Overall, this allows the bit-slice mapping unit to
decrease the required gate cost of its internal components in situations where an increased
delay is tolerable.
67
A possible implementation ofMU (z,y,n,α) is shown in Figure 5.7. A shift register acts as
Q
n
write−out
Counter
αMod−
write−iny
zU
B
en
clk
)nα,α(SR)αz,α(SR
z
)1,,y,(MU nααz
n
αα
FIGURE 5.7: Bit-slice mapping unit implementation
a parallel to serial converter and stores the z-bit source strings and outputs z
α
-bits every cycle
to the internal mapping unit MU ( z
α
,y,n
α
,1). The n
α
-bit output of the mapping unit is stored
in another shift register which parallelizes the α, n
α
-bit strings into one n-bit string. A mod-α
counter orchestrates this parallel to serial conversion by triggering a write-in operation on
the input shift register and a write-out on the output shift register every α cycles. This
allows a new source string to be input into the bit-slice mapping unit and an n-bit output q
written out every α cycles.
Because the bit-slice mapping unit is a sequential circuit, we modify the definition of
delay from Section 3.1.1. For sequential circuits, we assume that the clock delay of the
circuit to be the longer of (a) the longest path between any flip-flop output and any flip-flop
input and (b) the longest path between any circuit input and output. Using this notion of
delay, we have the following result.
Theorem 5.5 A bit-slice mapping unit MU(z,y,n,α), where z 6= 2y, can be realized as a
circuit with a gate cost of O
(
log2 α + n
(
1 + y2
y
α
))
and a delay of O(α(log logα+log n+ y)).
Proof: The input and output shift registers have a gate cost of O(z) and O(n), respectively,
and constant delays (Lemma 3.5). The mod-α counter has a gate cost of O(log2 α) and a
68
delay of O(log logα) (Lemma 3.7). The output of the mod-α counter is tested for value
α (with O(log logα) delay and O(logα) cost) to generate the bits that trigger the shift
registers. Because this output is fanned-out to all bits in both shift registers, it has a fan-out
of O(z + n) = O(n) and a delay of O(log z + log n) = O(logn) (Lemma 3.1).
Adding the cost and delay of the internal mapping unit, the total gate cost and delay
are O
(
log2 α + logα+ n + ny2
y
α
)
= O
(
log2 α + n
(
1 + y2
y
α
))
and a delay of O(α(log logα +
log n+ y + log n
α
)) = O(α(log logα + log n+ y)).
Remark: Note that for the number of subsets produced by a bit-slice mapping unit, the
allowed cost of the mapping unit is decreased by a factor of α, and the number of source
string bits is decreased by roughly a factor of α. Hence, the number of independent subsets
produced by the bit-slice mapping unit is Θ(2
y
α
log z
α
).
A factor that needs attention is the matter of how partitions play out in the bit-slice
mapping unit. For example, the subsets of Table 5.4 produced by a fixed mapping unit
MU (z,y,n,α) with z = 5, 2y = 2 require two ordered partitions (~pi1 = 〈{15, 14, 13, 11, 10, 9, 7, 6,
5, 3, 2, 1}, {12, 4}, {8}, {0}〉 and ~pi2 = 〈{15, 14, 13, 12, 11, 10, 9, 8}, {7, 6, 5, 4}, {3, 2}, {1}, {0}〉)
and four, 5-bit source strings (11111, 00111, 00011, 00001) to produce the n = 16-bit out-
puts. In a bit-slice mapping unit, with
⌈
z
α
⌉
= 2 and
⌈
n
α
⌉
= 4, only two ordered partitions
(~pi′1 = 〈{3, 2}{1, 0}〉, ~pi′2 = 〈{3, 2, 1}{0}〉) and three, 2-bit source strings (00, 01, and 11)
are needed to produce the n
α
-bit repeated patterns 0011, 0001, 0000, and 1111. For these
particular subsets of Zn, the bit-slice mapping unit shows good savings.
Consider the same example, but with the additional subset 0101010101010101. For z = 5,
2y = 2, two ordered partitions are needed, ~pi=〈{15, 13, 11, 9, 7, 5, 3, 1}, {14, 6, 2}, {12, 4}, {8}, {0}〉
and ~pi2 = 〈{15, 14, 13, 12, 11, 10, 9, 8}, {7, 6, 5, 4}, {3, 2}, {1}, {0}〉, along with four 5-bit source
strings (11111, 01111, 00111, 00001) to produce the 16-bit outputs. However, the bit-slice
mapping unit of this implementation now has to produce the 4-bit pattern 0101 in addition
to those previously required (in order to produce the subset 0101010101010101). Hence, a
third partition ~pi′3 = 〈{3, 1}, {2, 0}〉 is needed to produce all the 4-bit patterns. This implies
that the number of inputs needed at each multiplexer in the bit-slice mapping unit is three.
69
Since 2y doesn’t change between the mapping unit implementation and the bit-slice mapping
unit implementation, this results in a gate cost decrease of a factor slightly less then α. Thus,
in determining whether or not a bit-slice mapping unit is suitable to a design, a variety of
considerations must be taken into account.
Overall, the following theorem captures the performance of MU (z,y,n,α).
Theorem 5.6 For any α ≥ 1, a mapping unit MU (z,y,n,α) has the following performance
parameters:
a) delay of O(α(log y + logn),
b) gate cost of O
(
n
(
1 + y2
y
α
))
,
c) number of independent subsets λ = 2
y
α
⌊
log z
α
⌋
, and
d) total number of subsets produceable Λ = 2y(2z − 2), provided y <
⌈
n
z − 1
⌉
.
This chapter has provided a general view of the mapping unit decoder, in terms of its cost
and capabilities, and illustrated several means of realizing its operation. The next chapter
incorporates this structure as part of a larger design in the configurable decoder.
70
Chapter 6
A Configurable Decoder
In general, a decoder is a module that maps elements of {0, 1}x to {0, 1}n, where x n. In
a configurable decoder, this mapping can be altered. In this thesis we consider two types of
configurable decoders: (1) pure LUT-based configurable decoders (described in Section 3.3)
and (2) mapping unit-based configurable decoders (to which this chapter is devoted). As
noted in Figure 5.2, a mapping unit comes in different forms. Likewise, a mapping unit-
based configurable decoder can be integral or bit-slice, fixed or reconfigurable and general
or universal.
As noted in Section 3.3, the simplest configurable decoder is a LUT; however, it is
expensive and does not scale well. The main idea underlining our solutions is to use a LUT
with a “narrow” output (that provides a significant amount of flexibility, considering its
low cost) and a mapping unit (Chapters 4 and 5) that expands this narrow output into a
wide n-bit output representing a subset of Zn. Figure 6.1 shows a block diagram of the
configurable decoder. To put the figure in perspective, generally, x z  n. So, unlike the
pure LUT-based solution, our solution expands the x-bit input in stages to construct the
n-bit output.
MU (z,y,n,α)LUT
y
nzx
Q
U
B
A   
 
 
 
 
 
 
6
-- -
FIGURE 6.1: Block diagram of a configurable decoder CD(x,z,y,n,α)
As discussed in Chapters 4 and 5, the mapping unit MU (z,y,n,α) accepts as input a
z-bit string u ∈ U and an ordered z-partition ~pi (selected by a y-bit signal B). The
MU (z,y,n,α) then uses the operation µ(u, ~pi) to produce an n-bit string representative of
71
a subset of Zn. In this chapter we integrate MU (z,y,n,α) with a 2x × z LUT to create the
configurable decoder, CD(x,z,y,n,α) (shown in Figure 6.1).
At this point, a fair question to ask is “what does the LUT contribute?” As noted in
the previous chapters, the flexibility of the configurable decoder hinges on the LUT and the
value of z (number of independent subsets). While z larger than a polynomial in n does
not yield significant benefits, a small z (such as z = log n) severely limits the subsets that
can be generated by the mapping unit. Without the LUT, z has to be this small to address
the pin limitation problem. Thus the role of the LUT is to start from a small number of
input bits and expand it to z-bits, trading the value of z off with the number of locations in
the LUT. This provides ample room for constructing the configurable decoder to particular
specifications.
This chapter explores the properties and costs of our configurable decoder. In Section 6.1,
we illustrate the mapping units of Chapter 4 in the context of our configurable decoder. In
Section 6.2, we derive the basic parameters applicable to all our configurable decoders.
Finally, in Section 6.3, we cast these parameters in the context of a fixed gate cost and
compare the configurable decoder’s theoretical performance with that of a pure LUT-based
configurable decoder.
6.1 Illustrative Examples
Recall from Chapters 4 and 5 that the mapping unit uses a set Y of ordered partitions of
the set of Zn to expand the z-bit source strings to an n-bit subset of Zn. We begin by
providing an example of this operation in the context of a configurable decoder. The first
two examples demonstrate a configurable decoder with an integral mapping unit, where we
consider S = S0 ∪S1, where the sets S0 and S1 are from Table 4.1, page 43. Table 6.1 shows
these sets with their unordered partitions for n = 8, 2y = 2, and z = 4.
Example 6.1 : Using sets S0 and S1 from Table 6.1, order the partitions such that
~pi0 = 〈{0}, {7, 5, 3, 1}, {6, 2}, {4}〉 and ~pi1 = 〈{7, 6, 5, 4}, {3, 2}, {1}, {0}〉. Then, for set S0,
the source strings 1111, 1011, 1001, and 1000 would be needed to produce the subsets S00 ,
72
TABLE 6.1: Sets S0 and S1 with corresponding partitions
Sij q ∈ Q piSi
S00 11111111 {{0}, {7, 5, 3, 1}, {6, 2}, {4}}
S01 01010101
S02 00010001
S03 00000001
S10 11111111 {{7, 6, 5, 4}, {3, 2}, {1}, {0}}
S11 00001111
S12 00000011
S13 00000001
S01 , S
0
2 , and S
0
3 , respectively. Likewise, for set S1, the source strings 1111, 0111, 0011, 0001
would be needed to produce the subsets S10 , S
1
1 , S
2
1 , and S
3
1 , respectively. This implies that a
LUT with a size of at least 7× 4 would be needed to contain all the source strings. Assume
that we use a LUT of size 8×4 to store the source strings as we want to produce the subsets
of S0 ∪S1 in the order shown in Table 6.1 for the purpose of our algorithm (in this case, two
types of reduction). Assign the source string 1111 to the first row in the LUT, the source
string 1011 to the second row in the LUT, and so on. Table 6.2 shows the values needed for
the inputs of the configurable decoder to produce the desired subsets of S0 and S1. Here,
a total of 4 bits are needed to produce all subsets. Note that these aren’t the only subsets
producible by the decoder. If the source strings for set S0 were used with the partition ~pi1
and vice versa, different subsets are possible (see Table 6.3).
In the previous example, since all rows in the LUT were used to produce the subsets of
S0 and S1, the “extra” subsets generated by the configurable decoder were fixed. The next
example will explore a “better” ordering of the partitions S0 and S1 that provide additional
options.
Example 6.2 : Again using sets S0 and S1 from Table 6.1, order the partitions such that
~pi0 = 〈{7, 5, 3, 1}, {6, 2}, {4}, {0}〉 and ~pi1 = 〈{7, 6, 5, 4}, {3, 2}, {1}, {0}〉. Then, for both
73
TABLE 6.2: Input values needed for the configurable decoder to produce the subsets of S0
and S1 in Table 6.1
a ∈ A Source string b ∈ B ~pii q ∈ Q
000 1111 0 〈{0}, {7, 5, 3, 1}, {6, 2}, {4}〉 11111111
001 1011 01010101
010 1001 00010001
011 1000 00000001
100 1111 1 〈{7, 6, 5, 4}, {3, 2}, {1}, {0}〉 11111111
101 0111 00001111
110 0011 00000011
111 0001 00000001
sets S0 and S1, the source strings 1111, 0111, 0011, 0001 produce the subsets, where 1111
produces S00 and S
1
0 , 0111 produces S
0
1 and S
1
1 , 0011 produces S
0
2 and S
1
2 , and 0001 produces
S03 and S
1
3 . This implies that a LUT with a size of 4× 4 suffices to produce all subsets.
There are two cases to consider, each with their own advantages. If a 4 × 4 LUT is
used to hold the four needed source strings, than a savings in gate cost results over the
configurable decoder in Example 6.1 (as the LUT is reduced from a 8 × 4 LUT to a 4 × 4
LUT). No “extra” subsets can be generated, however, as all combinations of source strings
in the LUT and ordered partitions in the mapping unit are needed to produce the subsets
of S0 and S1. In the second case, the size of the LUT remains 8 × 4; however, only four
rows are needed to hold the source strings for S0 and S1. Thus, four additional rows exist
in the LUT which could be used to produce any four of the subset pairs from Table 6.4.
Note that selecting source string 1010, for example, means that both subsets 10111010 (from
µ(1010, ~pi0)) and 11110010 (from µ(1010, ~pi1)) could be generated. The implications of this
are that the ordering of the partitions can determine not only the size of the LUT in the
configurable decoder (and thus also the values of parameters), but also the subsets that can
be produced.
74
TABLE 6.3: Subsets produced by combining source strings of S0 (resp., S1) with partition
of ~pi1 (resp., ~pi0)
Si Source String (u) ~pii µ(u, ~pii)
S0 1111 〈{7, 6, 5, 4}, {3, 2}, {1}, {0}〉 11111111
1011 11110011
1001 11110001
1000 11110000
S1 1111 〈{0}, {7, 5, 3, 1}, {6, 2}, {4}〉 11111111
0111 11111110
0011 01010100
0001 00010000
The next example illustrates a configurable decoder with a bit-slice mapping unit.
Example 6.3 Consider the sets S = S0 and S1 shown in Table 6.5, where z = 5 and 2y = 2.
Note that the ordered partitions for sets S0, S1 are ~pi0 = 〈{15, 13, 11, 9, 7, 5, 3, 1}, {14, 10, 6, 2},
{12, 4}, {8}, {0}〉 and ~pi1 = 〈{15, 14, 13, 12, 11, 10, 9, 8}, {7, 6, 5, 4}, {3, 2}, {1}, {0}〉, respec-
tively. Then a CD(x,z,y,n,α) with a fixed mapping unit would require 16 multiplexers with
2 inputs each and a 5× 5 LUT to hold the values of the source strings (note that this is due
to the intelligent ordering; in general the LUT could be as much as 10× 5). Assume that
α = logn = 4. Then in each iteration of a CD(x,z,y,n,α), the decoder must produce the
n
α
-bit strings from the
⌈
z
α
⌉
-bit strings shown in Table 6.6.
For these n
α
-bit strings, three partitions are needed, ~pibs0 = 〈{3, 2}, {1, 0}〉, ~pibs1 = 〈{3, 1}, {2, 0}〉,
and ~pibs2 = 〈{3, 2, 1}, {0}〉. Since the original fixed mapping unit had values of z = 5 and
2y = 2, the number of inputs to each multiplexer in the internal mapping unit of the bit-slice
mapping unit would increase by one (from 2 to 3). However, the number of multiplexers
would decrease from n = 16 to n
α
= 4. This would imply a reduction in cost by a factor of
16×2
4×3
≈ 2.67.
75
TABLE 6.4: Possible subsets produceable from µ(uj, ~pi0) and µ(uj, ~pi1); ~pi0 =
〈{7, 5, 3, 1}, {6, 2}, {4}, {0}〉, ~pi1 = 〈{7, 6, 5, 4}, {3, 2}, {1}, {0}〉
uj ∈ U µ(uj, ~piS0) µ(uj, ~piS1)
0000 00000000 00000000
0001 00000001 00000001
0010 00010000 00000010
0011 00010001 00000011
0100 01000100 00001100
0101 01000101 00001101
0110 01010100 00001110
0111 01010101 00001111
1000 10101010 11110000
1001 10101011 11110001
1010 10111010 11110010
1011 10111011 11110011
1100 11101110 11111100
1101 11101111 11111101
1110 11111110 11111110
1111 11111111 11111111
Regardless, the LUT must still supply a z-bit word to the bit-slice mapping unit (which
in this case may increase to a 6-bit word based on the rounding of
⌈
z
α
⌉
). Thus, the im-
plementation depends on the allowable costs, the number of z-bit source strings and the
corresponding size of the LUT, and the subsets that must be produced.
With these examples providing the proper context of the mapping unit with regards to the
preceding LUT, we now proceed to the performance of a configurable decoder CD(x,z,y,n,α).
76
TABLE 6.5: Sets S0 and S1 of Z16 for Example 6.3
Sij q ∈ Q z ∈ U
S00 1111111111111111 11111
S01 0101010101010101 01111
S02 0001000100010001 00111
S03 0000000100000001 00011
S04 0000000000000001 00001
S10 1111111111111111 11111
S11 0000000011111111 01111
S12 0000000000001111 00111
S13 0000000000000011 00011
S14 0000000000000001 00001
6.2 Performance of CD(x,z,y,n,α)
In this section we develop general expressions for the delay, gate cost, and subsets that can
be produced by a configurable decoder CD(x,z,y,n,α).
Delay: The delay of CD(x,z,y,n,α) is clearly the sum of the delays due to a 2x × z LUT
and a MU (z,y,n,α). Therefore we have the following result.
Theorem 6.1 For any α ≥ 1, a configurable decoder CD(x,z,y,n,α) has a delay of O(x +
log z + α(y + log n)).
Proof: By Lemma 3.4, a 2x × z LUT has a delay of O(x + log z). By Theorem 5.6, the
delay of a mapping unit MU (z,y,n,α) is O(α(y + log n)). Overall, this results in a delay of
O(x+ log z) +O(α(y + log n)) = O(x+ log z + α(y + log n)).
Remark: In general, y = O(logn), x = O(logn), and z is polynomial in n. Therefore, the
delay is usually O(α log n).
77
TABLE 6.6: n
α
-bit strings produced from
⌈
z
α
⌉
-bit input strings in CD(x,z,y,n,α)
Sij
⌈
z
α
⌉
-bit input string n
α
-bit string produced
S00 11 1111
S01 01 0101
S02 01 0001
S03 00, 01 0000, 0001
S04 00, 01 0000, 0001
S10 11 1111
S11 00, 11 0000, 1111
S12 00, 11 0000, 1111
S13 00, 01 0000, 0011
S14 00, 01 0000, 0001
Gate Cost: As in delay, the gate cost of CD(x,z,y,n,α) is the summation of the gate costs
of a 2x × z LUT and a MU (z,y,n,α). We now have the following result.
Theorem 6.2 For any α ≥ 1, a configurable decoder CD(x,z,y,n,α) has a gate cost of
O
(
2x(x+ z) + n
(
1 + y2
y
α
))
.
Proof: By Lemma 3.4, a 2x× z LUT has a gate of O(2x(x+ z)). By Theorem 5.6, the gate
cost of a mapping unit MU (z,y,n,α) is O
(
n
(
1 + y2
y
α
))
. Overall, this results in a gate cost
of O(2x(x+ z)) +O
(
n
(
1 + y2
y
α
))
= O
(
2x(x+ z) + n
(
1 + y2
y
α
))
.
Producible Subsets: Recall from Chapter 4 that the subsets produced by a decoder can
be broadly divided into two classifications: independent subsets (that is, the set S ′) and
subsets produced by the decoder that are a result of choices made in the configuration of
the decoder (that is, the set S ′). We extend the results of Chapter 4 here, beginning with
the set S of independent subsets.
78
Theorem 6.3 A configurable decoder CD(x,z,y,n,α) can produce at least λ = min
{
2x,
2y
α
⌊
log z
α
⌋}
independent subsets.
Proof: By Theorem 5.6, a mapping unit MU (z,y,n,α) can produce 2
y
α
⌊
log z
α
⌋
independent
subsets of Zn. Since each source string can be unique, each of the source strings uses one
of the 2x rows in the LUT preceding the mapping unit. Thus, the number of independent
subsets produced by CD(x,z,y,n,α) is at least λ = min
{
2x,
2y
α
⌊
log z
α
⌋}
.
We now extend the results for the maximum number of subsets producible by a config-
urable decoder CD(x,z,y,n,α).
Theorem 6.4 For 2x ≤ 2z − 2 and y ≤
⌈
n
z − 1
⌉
− 1, a configurable decoder CD(x,z,y,n,α)
exists that can produce Λ = 2x+y distinct subsets of Zn.
Proof: By Theorem 4.3, a MU (z,y,n,α) using the Lemmas 4.2 and 4.3 can produce CY
subsets, where Y = 2y ≤ 2d nz−1e−1, and C is a subset of the 2z − 2 values of U that can
result in distinct subsets of Zn. As the LUT can produce a subset of the 2z values of U ,
then if 2x = C ≤ 2z − 2, a configurable decoder consisting of a 2x × z LUT and the same
MU (z,y,n,α) can produce CY = 2x+y distinct subsets of Zn.
These results are now used to establish that our configurable decoder asymptotically
outperforms a pure LUT-based configurable decoder in every conceivable situation.
6.3 Gate-Cost Constrained Configurable Decoders
In this section we consider a configurable decoder CD(x,z,y,n,α) whose gate cost is G ≥ n.
We constrain the delay to be O(α logn) and G to be polynomial in n. Also, recall from
Chapter 5 that z  n and α is polylogarithmically bounded in n, that is, α = O(logk n) for
constant k > 0. We first derive conditions on x, z, and y needed to preserve the gate cost
of G. Before we proceed, we note the maximum number of independent subsets for a pure
LUT-based configurable decoder.
79
Lemma 6.1 A pure LUT-based configurable decoder with gate cost G can produce at most
Θ
(
G
n
)
independent subsets.
Proof: By Lemma 3.4, a 2zL × m LUT has a gate cost of O(2zL(zL + m)), where zL is
the number of input bits, 2zL is the number of rows in the LUT, and m is the number of
output bits (that is, the length of the word in the LUT). Since m = n, each independent
subset requires one row in the LUT. This results in O(2zL(zL+n)) cost. For a cost of G, we
have 2zL = Θ
(
G
n
)
. This implies that the maximum number of rows (and thus, independent
subsets) in the LUT is Θ
(
G
n
)
.
Remark: For a pure LUT-based configurable decoder, the number of independent subsets
λ is also the total number of subsets producible, Λ.
From Theorem 6.1 a delay of O(α logn) implies x+ log z + α(y + log n) = O(α logn), or
x
α
+
log z
α
+ y = O(logn). (6.1)
Since z ≤ n, log z = O(logn) is guaranteed. The constraints that x
α
+ y = O(logn) implies
that x + y is polylog in n (as α is polylog in n). This is consistent with the fact that the
number of pins entering the configurable decoder must be small.
From Theorem 6.2, a gate cost of G implies that
2x(x+ z) = O(G) (6.2)
and
n
(
1 +
y2y
α
)
= O(G). (6.3)
From Equation 6.2 we have
2x = O
(
G
logG
)
= O
(
G
logn
)
(6.4)
80
and
z = O
(
G
2x
)
. (6.5)
From Equation 6.3, we have
2y = O
(
Gα
n
log Gα
n
)
. (6.6)
Since the number of independent subsets is Θ(2y log z) and the cost of the LUT increases
with z, we need a large value of z (to get a larger number of independent subsets), but not
so large that the LUT becomes too expensive. Select
z = Θ(n) (6.7)
for some small constant  > 0, so that log z is still Θ(logn) but the contribution of z to the
LUT cost is Θ(n). Since x = O(logk n), for constant k, we have x + z = Θ(z). So, from
Equation 6.5 select
2x = Θ
(
G
z
)
= Θ
(
G
n
)
. (6.8)
Clearly this will result in Θ(logn) delay and Θ(G) gate cost.
The number of independent subsets produced is Θ
(
min
{
2x, 2
y
α
log z
α
})
(see Theorem 6.3)
which is Θ
(
min
{
G
n
,
G log n

α
n log(Gαn )
})
.
Observe that α = o(nδ) for every constant δ > 0. Therefore, Θ
(
min
(
G
n
,
G log(n

α )
n log(Gαn )
))
=
Θ
(
min
(
G
n
, G logn
n log(Gn )
))
.
Note that for asymptotically large n, log
(
n
α
)
= Θ(log n) and log
(
Gα
n
)
= Ω(1). So,
G log(n

α )
n log(Gαn )
= O
(
G logn
n
)
= O
(
G
n
)
. So for asymptotically large n, the number of independent
subsets is Θ
 G log n
n log
(
G
n
)
 = Θ
 G logn
n log
(
Gα
n
)
.
If G
n
= Θ(logσ n) for constant σ > 0 then log
(
G
n
)
= Θ(log log n). Here, the number
of independent subsets is Θ
(
G logn
n log logn
)
(as α is polylog in n), while the maximum number
of dependent subsets can be as large as Θ(2x+y) = Θ
(
G
n
Gα
n
1
log logn
)
= Θ
(
n1− log2σ n
log logn
)
. On
the other hand, if G = Θ(nσ) for any σ > 0, then the number of independent subsets is
81
Θ
(
G
n
)
, while the number of dependent subsets can be as large as Θ(2x+y) = Θ
(
nδ
logψ n
)
, for
constants δ, ψ > 0.
From the above discussion and Lemma 6.1, we have the following result that establishes
the advantages of our configurable decoder compared to the pure LUT-based solution.
Lemma 6.2 A configurable decoder CD(x,z,y,n,α) with polylogarithmically bounded α and
polynomially bounded gate cost G ≥ n produces at least λ independent subsets, where
λ =

G logn
n log logn
, if G
n
is polylogarithmically bounded in n
G
n
, otherwise,
and it is capable of producing a total number of Λ subsets, where
Λ =

G
n
(
n logσ n
log logn
)
, if G
n
is polylogarithmically bounded in n
G
n
(
n
logσ n
)
, otherwise,
where , σ > 0 are constants.
Remark: Since the total number of dependent subsets depends on the value of 2x, a different
choice in the values of z may allow 2x to be slightly larger, thereby also increasing the number
of total subsets producible by a configurable decoder. However, this would also decrease the
number of independent subsets; therefore, we do not consider it here.
From Lemma 6.2, we have the following.
Theorem 6.5 Let P be a pure LUT-based configurable decoder and let C be the proposed
configurable decoder, each producing subsets of Zn. If both decoders have a gate cost of
Θ(G) ≥ n, then
a) if G = Θ(n logσ n), then C produces a factor of Θ
(
logn
log logn
)
more independent subsets than
P and is capable of producing a factor of Θ
(
n logσ n
log log n
)
more dependent subsets for any
constant 0 ≤  < 1.
82
b) if G = n1+σ, then C would produce the same order of independent subsets as P and is ca-
pable of producing up to Θ
(
G
n
(
n
logσ n
))
dependent subsets, for any constants 0 ≤  < 1.
This chapter has shown that the proposed configurable decoder has substantial advan-
tages over both fixed and pure LUT-based configurable decoders.
83
Chapter 7
Implementations of Useful Subsets
Many applications and algorithms display standard patterns of resource use. For example,
consider a binary tree reduction, shown in Figure 7.1 [11]. In each reduction, the number
76543210
0 2 4 6
0 4
0
Corresponding n-bit
patterns
00000001
00010001
01010101
11111111
(a)
76543210
0 3
0 1
0
1 2
Corresponding n-bit
patterns
00000001
00000011
00001111
11111111
(b)
FIGURE 7.1: Two binary tree reductions of n = 8 elements
of resources is reduced by a factor of two in each level of the tree; Figure 7.1(a) and (b)
illustrate this for two particular reductions. The bit patterns representing these reductions
are also shown, where a bit has a value of ‘1’ if it survives the reduction at a particular level
in the tree and a value of ‘0’ if it does not.
Communication patterns can also induce subsets. For example, if a node can either
send or receive in a given communication, but not both simultaneously, then for an AS-
84
CEND/DESCEND pattern of communications [1] we have the send/receive pairs shown in
Figure 7.2. The subsets represent a set of processors that may be sending (or receiving)
76543210
76543210
76543210
76543210 Corresponding n-bit
patterns
10101010
01010101
11001100
00110011
11110000
00001111
FIGURE 7.2: ASCEND/DESCEND communication pairs for n = 8
simultaneously.
In this chapter we examine three useful classes of subsets, namely (1) Binary Reduction
(Section 7.1), (2) ASCEND/DESCEND (Section 7.2), and (3) 1-hot (Section 7.3). We
examine ways of implementing these classes of subsets in mapping units as an indication of
where the mapping unit can successfully take advantage of patterns in communication and
where certain patterns pose challenges.
7.1 Binary Reduction
As illustrated previously, the class of binary tree based reduction algorithms reduces the
number of resources by a factor of two in each level of the algorithm. This reduction can
occur in a variety of ways; regardless, all binary tree based reductions have the following
properties.
1. For any set S of subsets with n-bit patterns characterizing a binary tree based reduc-
tion, the number of subsets in the set is log n + 1 (the additional subset comes from
including the root of the binary tree).
85
2. Assume that if Si(j) = 1, then resource j participates in the reduction. Then, if the
subsets in S are ordered such that S0 ∈ S corresponds to the state in the reduction
where only 1 resource exists, S1 corresponds to the state in the reduction where 2
1 = 2
resources participate, and so on, then the number of bits with a value of ‘1’ in subset
Si is 2
i. Also for each i ≤ logn, Si ⊂ Si+1.
As an example of this, consider the two binary tree based reductions illustrated previously,
and shown again here in Table 7.1. Consider the set S20 and the set S
2
1 . Here, i = 2; thus,
TABLE 7.1: Two binary tree based reduction patterns
Si0 n-bit pattern S
i
1 n-bit pattern
S00 00000001 S
0
1 00000001
S10 00010001 S
1
1 00000011
S20 01010101 S
2
1 00001111
S30 11111111 S
3
1 11111111
the number of bits with a value of ‘1’ in the n-bit pattern is 22 = 4, which is what is shown.
From this, we can conclude that a mapping unit with a single (logn+ 1)-block partition
pi can produce all log n + 1 subsets with log n + 1 source strings. Note that the product of
any two partitions induced by the subsets of S result in exactly one new block, as exactly
2i−1 bits are different between Si−1k and S
i
k, and all 2
i−1 bits that are different have the same
value. Hence, piSi−1
k
piSi
k
results in one new block to account for these 2i−1 bits.
For example, consider the set S0. Then, the induced partitions for the subsets are pi0,0 =
{{7, 6, 5, 4, 3, 2, 1}, {0}}, pi0,1 = {{7, 6, 5, 3, 2, 1}, {4, 0}}, pi0,2 = {{7, 5, 3, 1}, {6, 4, 2, 0}}, and
pi0,3 = Zn. Note that the product of pi0,0pi0,1 results in 20 = 1 bit in a new block (bit 4); the
product of pi0,0pi0,1pi0,2 results in 2
1 = 2 bits in a new block (bits 6 and 2), and so on.
From this illustration, we can also note that if a single configurable decoder is to produce
two or more such binary tree based reductions, then the (logn+1)-partitions can be ordered
such that the same log n+1 source strings produce any of the sets, as source string i contains
the same number of 1’s and 0’s corresponding to the blocks in the partition regardless of the
layout of resource allocation in a binary tree based reduction.
86
7.2 ASCEND/DESCEND
The subsets of the ASCEND/DESCEND class of communications (See Figure 7.2) are
more difficult than those of the binary tree based reduction for a mapping unit to pro-
duce. This is because the product of all induced partitions of the 2 logn subsets of the
ASCEND/DESCEND class of communications results in an n-partition of Zn; as z  n,
this cannot be represented by a single partition.
One method of generating these subsets is to use logn
log z
z-partitions, each with 2 log z
source strings (where z is a power of 2, say z = 2k). Note that for a given level of the
ASCEND/DESCEND communications, the send/receive pairs are complements; since all
bit positions have different values between the two subsets for a given level, a single 2-
partition can represent both subsets with 2 source strings. For example, the partition for
the first level of communications is pi1 = {{7, 5, 3, 1}, {6, 4, 2, 0}}. Taken for log z such levels,
this results in a single z-partition that with 2 log z source strings can produce 2 log z of the
different 2 logn subsets. For example, consider z = 4. Then, log z = 2, which implies that
two levels can be represented by a single partition. If a partition represents levels one and
two, then this results in the partition pi = {{7, 3}, {6, 2}, {5, 1}, {4, 0}}.
Taken for all 2 logn subsets, this results in a total of logn
log z
such partitions, and a total of
2 log z source strings. Table 7.2 illustrates a possible ordering of the partitions and source
strings for the ASCEND/DESCEND bit patterns shown in Figure 7.2.
7.3 1-Hot
Recall from Section 3.2.2 that a set of 1-hot subsets is a set of n-bit subsets of Zn, where
each of the n-bit patterns has only one active bit (usually with a value of ‘1’), all other bits
being inactive (usually ‘0’). Table 7.3 illustrates this for n = 16.
Even though the 1-hot sets are easy to produce in a conventional fixed decoder, they
present one of the more difficult classes for our configurable decoder. Note that each subset
of the 1-hot set has an induced partition with 2 blocks, where one block contains the bit
position of the bit with a value of ‘1’ and the other block contains all other bit positions.
87
TABLE 7.2: Partitions and source-strings generated for ASCEND/DESCEND bit patterns;
for n = 8 and z = 4
Si ~pi Source strings Bit-pattern
S0 〈{7, 3}, {6, 2}, {5, 1}, {4, 0}〉 1010 10101010
S1 0101 01010101
S2 1100 11001100
S3 0011 00110011
S4 〈{7, 6, 5, 4}, {3, 2, 1, 0}〉 dd10 11110000
S5 dd01 00001111
d denotes a don’t care value
Without loss of generality, assume that block B0 is the single element block in each induced
partition. Since each subset has a different bit position with a value of ‘1’, then each induced
partition has a different bit position in block B0. Using the method from Section 4.1.2
(page 43), each product of piipij would result in a partition with an additional block. Taken
for all n partitions, this would result in an n-partition; clearly, this is difficult for a mapping
unit to produce as each partition used by it has at most z  n blocks.
As noted in Section 7.2, the ASCEND/DESCEND class of subsets also induces an n-
partition; however, unlike that class, we have a simpler solution here. One method of pro-
ducing the 1-hot subsets in a configurable decoder is to use a LUT with 2x = n rows (or
x = logn). By Lemma 3.4, a LUT contains a 1-hot address decoder. Since a configurable
decoder CD(x,z,y,n,α) contains a 2x × z LUT, with n = 2x, a simple switch allowing the
output of the LUT’s address decoder to be the output of the configurable decoder automat-
ically allows the configurable decoder to produce the 1-hot subset. We develop a slightly
different solution in Section 9.1.
88
TABLE 7.3: A set of 1-hot subsets of Z16
Si n-bit value
S0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
S1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
S2 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
S3 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
S4 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
S5 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
S6 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
S7 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
S8 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
S9 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
S10 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
S11 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
S12 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
S13 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
S14 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
S15 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
89
Chapter 8
Simulation Results
This chapter presents simulation results for the configurable decoders of Chapter 6. While
the previous chapters analytically established the validity of our approach, the simulations in
this chapter allow the constants hidden by the asymptotic notation in the cost equations to
be analyzed and trends in the data to be extrapolated. The aim of the simulations is twofold,
(1) to compare our solutions to existing solutions and (2) to derive reasonable predictors of
the constants (cost factors independent of problem size n) across technologies. While these
results are specific to our implementation, they nevertheless are good predictors of trends
that may be expected from the state of the art technology.
Section 8.1 outlines our simulation methodology, including the parameters for all sim-
ulations and an explanation as to why they were chosen, details regarding the CAD tools
used, and the analysis methods. Section 8.2 provides the simulation results for both inte-
gral decoders (Section 8.2.1) and bit-slice decoders (Section 8.2.2). Finally in Section 8.3,
regression functions and expected trends are illustrated.
8.1 Methodology
In this section, we detail the rationale for our choice of simulation parameters, including
details of problem size, CAD tools, and analysis methods.
Choice of Problem Parameters: As noted above, one of the aims of the simulations is
to compare existing decoders to the proposed solutions. The 1-hot decoder is simple and one
of the most widely used. Its Θ(logn) delay and Θ(n logn) gate cost are used as baselines
against which other delays and gate costs are compared. However, the 1-hot decoder has no
flexibility (λ = 0, see Section 3.1.1) in terms of the types of subsets it can produce. Among
the configurable decoders, we expect the configurable decoder with a fixed mapping unit to
have the best performance (measured as the number of independent subsets for a given gate
cost). This is because all the configurable decoders can be wired or reconfigured to produce
90
any set of 2y log z independent subsets; however, a configurable decoder based on a fixed
mapping unit has the lowest constant terms for its gate cost due to its complexities.
In addition to the 1-hot decoder and the configurable decoder with a fixed mapping unit,
the modules we study in this chapter include the pure LUT-based configurable decoder (see
Section 3.3) and the configurable decoder with a reconfigurable mapping unit (Section 5.2).
We also separately consider a universal configurable decoder with reconfigurable mapping
unit and a configurable decoder with fixed mapping unit with similar design parameters.
We do not expressly simulate the bit-slice configurable decoder, but outline an approach to
derive results for it. For each decoder, we let n = 2k for k ≥ 2. For most simulations, system
restrictions allow values up to n = 256.
Consider the decoders shown in Figure 8.1. All decoder have n-bit outputs. Of these the
value of n completely specifies only the 1-hot decoder. For the configurable decoders, we use
the number of independent subsets generated as a common thread.
Consider the configurable decoder with fixed mapping unit of Figure 8.1(c). Let its gate
cost be G = n logn. Then, from Section 6.3, we have 2yf =
G
n
log
(
G
n
) ≈ logn
log log n
. Also let
zf = n
 (Section 6.3). Clearly,  needs to be as large as possible. The value of xf is bounded
by the cost of the LUT in the configurable decoder, so we set 2xf =
G
zf
=
G
n
. However, the
number of independent subsets produced λf = 2
yf log zf must be no more than 2
xf =
G
n
.
Putting these together we have 2xf =
G
n
≥ 2yf log zf ≈  log
2 n
log log n
.
Thus, for each value of n, we need to find the largest value of  such that
n1−

≥ log n
log log n
.
Table 8.1 shows the values of n, the corresponding , and value of n,
 log2 n
log log n
, and
log n
log log n
(the last three being indicative of the values of zf , λf ≈ 2xf , and 2y, respectively).
With this table as the guildeline, we now derive values of zf , xf , and yf that produce
λf independent subsets for a configurable decoder with fixed mapping unit. Table 8.2 shows
these values. In deriving these values, we made simple approximations (using d e and b c) to
91
- -
 
 
 
 1-hot
log n n
A1 Q
(a)
- -
 
 
 
 
n
Al QLUT
x`
(b)
yfu
yf
zfu
Bf
Af
zf
xfu
Q
xf n
FMULUT
 
 
6
 
 
 
 
 
 
- --
(c)
nxr
Q
xru
zr
Ar
zru
Br
yr
yru
nyr nyru
Config.
LUT
FMU- --   
 
 
 
 
 
 
-
6
 
 
LUT
(d)
FIGURE 8.1: Block diagrams of all decoders simulated, (a) 1-hot, (b) pure LUT-based, (c)
configurable decoder with FMU, and (d) configurable decoder with RMU
92
TABLE 8.1: Parameter values for a configurable decoder with FMU, for G = n logn
n  n  log
2 n
log logn
logn
log logn
4 0.7284997 2.754537 2.914 2
8 0.8002940 5.28126 4.54437 1.89279
16 0.8210948 9.74309 6.56876 2
32 0.8318129 17.865 8.95606 2.15338
64 0.8395750 32.8415 11.6925 2.32112
128 0.8461295 60.6698 14.7685 2.49345
256 0.8520038 112.676 18.1761 2.6667
ensure that parameters (signal sizes, number of subsets, etc.) are integers. We now use λf
for each value of n as the basis to derive parameters for other configurable decoders. In fact,
TABLE 8.2: Parameter values for a configurable decoder with fixed mapping unit (CDF)
n xf yf zf λf
4 2 1 3 3
8 3 1 6 5
16 3 1 10 7
32 4 2 18 9
64 4 2 33 12
128 4 2 61 15
256 5 2 113 19
we adjust the parameters to make the number of independent subsets λ the same across all
configurable decoders.
For a pure LUT-based configurable decoder (see Figure 8.1(b)) the number of independent
subsets is λ` = 2
x`. If we set λ` = λf , then
x` = log λ` = log
(
 log2 n
log logn
)
= log + 2 log logn− log(3) n.
93
Table 8.3 shows the values for x` and λ` substituted for the LUT. For example, with n = 4,
we have x` = 2 and λ` = 3. Though a 2 address LUT can have 4 locations, our simulation
ensured that only λ` = 3 locations were used in the LUT.
TABLE 8.3: Parameter values for a λ` × n LUT
n x` λ`
4 2 3
8 3 5
16 3 7
32 4 9
64 4 12
128 4 15
256 5 19
For the configurable decoder with reconfigurable mapping unit (see Figure 8.1(e)), we
have xr = xf , yr = yf , and zr = zf (which is the same as in Table 8.2). However, since the
configurable decoder with reconfigurable mapping unit uses a 2yr×nyr LUT (Table 8.4) also
shows nyr.
TABLE 8.4: Parameter values for a configurable decoder with reconfigurable mapping unit
n xr yr zr nyr λr
4 2 1 3 4 3
8 3 1 6 8 5
16 3 1 10 16 7
32 4 2 18 96 9
64 4 2 33 128 12
128 4 2 61 256 15
94
The universal version of the configurable decoder with reconfigurable mapping unit sets
2yr = zr (see Section 5.2). We chose the value of yr such that λr = 2
yr log zr = yr2
yr = λf .
Table 8.5 shows the values for this configurable decoder.
TABLE 8.5: Parameter values for a universal configurable decoder with reconfigurable map-
ping unit
n xru yru zru nyru λru
4 2 2 3 8 3
8 3 2 4 16 5
16 3 2 4 32 7
32 4 3 6 96 9
64 4 3 6 192 12
128 4 3 8 384 15
We also tested a configurable decoder with fixed mapping unit using the same parameters
for y and z as in the universal configurable decoder with reconfigurable mapping unit.
Regression Analysis: The simulations described above were used for the nonlinear re-
gression analysis. For most cases, the data obtained was sufficient to produce steady state
results.
Simulation Environment: Figure 8.2 shows the basic structure of the simulation pro-
cess. We briefly describe its components.
Source Code Development: All decoders were defined in Verilog Hardware Description Lan-
guage, or Verilog HDL (for example, see [10])
Functional Testing: A functional verification of the hardware description files took place
using the Cadence NC-Verilog tool [8]. This provides a verified template for n = 16
and selections of other parameter values. This template was modified as described
below.
95
Output
Archival Unit
Output
Archived 
instance
Files
Development
Source code
Instantiator
Controller
Testing
Functional
Templates
Verified
Synthesis
Tool
FileCommand
FIGURE 8.2: Simulation process
96
Instantiator: This UNIX shell script creates a new set of files from a given verified template
for a given set of parameters. For example, if a verified template with n = 16 uses 16
multiplexers called MUX 0 . . . MUX 15, when n = 32, we have MUX 0 . . . MUX 31,
all of which must be separately defined. This script automates this file conversion.
Controller: The set of input parameters is provided to the instantiator by a controller
script, which systematically plods through feasible parameters values (described in Ta-
bles 8.2–8.5) for the different configurable decoder implementations. These parameter
constructions are also used to customize the commands to the synthesis tool and to
save the outputs systematically.
Synthesis Tool: The synthesis of the hardware was performed using the Cadence Physically
Knowledgeable Synthesis (PKS) tools [7]. Cadence PKS performs a physical mapping
of a hardware description to a given process and technology, and using this mapping,
derives the overall area, delay, and power consumption of the design in terms of square
microns, nanoseconds, and milliwatts, respectively. The area and number of gates
would differ primarily in situations where the interconnects dominate the area. In
all our designs, the interconnects (wires) occupied an insignificant part of the area.
Consequently, the area data is also indicative of the number of gates. In the synthesis
of our designs, we used a 0.25 µm process technology library developed by Artisan
Components, Inc.1 For each hardware design, we performed one synthesis optimizing
for area and another optimizing for delay. It was found that for our designs, there were
no significant differences between the different optimizations (typically, a small number
of gates and one or two hundredths of a nanosecond were the differences between the
two cases); hence, the results presented here, for all simulations, were optimized for
area.
Archival Unit: This script uses the current parameters instance (provided by the controller)
to save the simulation output in an appropriately named file.
1The technology library used, “demo25,” copyright 2000 Artisan Components, Inc.
97
As the specific internal connections of all mapping units (except the universal map-
ping unit) are not fixed and determine the functionality of the mapping unit, a variety of
connection choices were made that distributed the wirings differently across multiplexers.
Figure 8.3 shows some basic connection patterns between source string bits and MUXs. We
(a) Local Clusters (b) Shuﬄe
(c) ASCEND (d) Reduction
FIGURE 8.3: Wire distributions in simulated mapping units.
used these basic schemes and their combinations. While a different distribution might reduce
the resulting area slightly; overall, it was found to not make a significant difference.
We note the following assumptions and limitations that occurred during the synthesis of
the designs.
1. As we did not have access to a memory generator, all synthesized memory elements
(including LUTs) resulted in arrays of sequential elements (flip-flops). Additionally,
the synthesis was not able to create a memory element with single port read and
write capabilities; thus all memory cells were dual-ported. This results in a substantial
increase in the size of the memory generated over what would be expected of traditional
implementations (such as SRAM). In Section 8.2 we provide an interpretation of the
data that, to an extent, alleviates this concern.
98
2. The implementation of the fan-in and fan-out of signals was left up to the synthesis
tool. For some particular designs and for some particular values of n, the designs could
be optimized effectively; in other cases, it was apparent that the fan-out of the signals
resulted in drivers with a large delay.
3. The system executing the simulations was unable to synthesize certain designs for
n = 256 and all designs for n > 256. Hence, any trends are derived from data points
that extend from 2 ≤ log n ≤ (7, 8).
However, despite these limitations, this chapter still demonstrates (a) a comparison of
the performance of the configurable decoder to the current state of the art on a relatively
level playing field and (b) an observation of the trends in that performance, and with some
extrapolation, a prediction of future trends with newer technology files.
8.2 Simulations
In this section we present simulation data for the delay, area (raw and adjusted for memory
implementation), and power consumption for different modules. The data is categorized by
module name and the value of n. Other parameters needed to determine the module (such
as x and y for configurable decoders) have been specified in Tables 8.2–8.5. The first set of
data is for the following integral decoders.
1. 1-hot decoder (1-Hot)
2. Pure LUT-based decoder (LUT)
3. Configurable decoder with a fixed mapping unit (CDF)
4. Configurable decoder with a reconfigurable mapping unit (CDR)
5. Universal configurable decoder (Univ.)
6. Counterpart of universal decoder with a fixed mapping unit (F-Univ.)
The second set of data is for the bit-slice configurable decoders. As there are two inde-
pendent variables used in the construction of bit-slice decoders (n and α), the data presented
99
is (a) the LUT configurations used in the bit-slice decoders (3–6 above), (b) mapping units
for a range of n, from which mapping units for a range of n
α
are used in the regression
analysis, (c) the cost of the shift registers and mod-α counters for a range of n and α, and
(d) the extrapolated cost of bit-slice configurable decoders for a range of n and α and for
the mapping units presented. Note that (1) we do not compare the bit-slice configurable
decoders with a 1-Hot or a LUT, due to the difference in their capabilities and the situations
in which a bit-slice configurable decoder would be employed, and (2) due to the complexity
of a regression analysis required for the bit-slice configurable decoder, we only present the
derived area.
8.2.1 Integral Decoders
TABLE 8.6: Integral decoder delays [ns]
n 1-Hot LUT CDF CDR Univ. F-Univ.
4 0.16 0.86 1.23 1.17 0.85 0.79
8 0.26 1.59 1.67 1.64 1.78 1.65
16 0.34 3.51 2.95 2.91 3.67 3.18
32 0.79 3.52 4.14 4.22 4.76 4.01
64 1.16 2.39 3.18 2.97 5.71 4.47
128 1.44 5.2 5.02 5.57 9.08 6.67
256 1.85 12.74 8.23 - - 4.16
Table 8.6 and Figure 8.4 illustrate the delay of the configurable decoders as compared
to the 1-hot decoder and the LUT. Note that the implementations with larger LUTs (the
configurable decoders with RMUs and the LUT) have larger delays; this is most likely an
effect of the implementation of the LUT as a sequential circuit. The discontinuities, primarily
at n = 64, are likely due to a technology dependent factor such as fan-in/fan-out.
Table 8.7 and Figure 8.5 illustrate the results compared against a λ × n LUT and the
log n to n 1-hot decoder. As demonstrated, the configurable decoders with fixed mapping
units perform very well against the pure-LUT based implementation and, out of all the
100
0 50 100 150 200 250 300
0
2
4
6
8
10
12
14
Word Size (n)
D
el
ay
(n
s)
 
 
LUT
1−Hot
CDF
F−Univ
CDR
Univ
FIGURE 8.4: Integral decoder delays [ns]
configurable decoders, come closest to the area of the 1-hot decoder. Interestingly, the
F-Univ. has a lower area than the CDF; this arises from the value of z being O(n) in the
CDF, as the corresponding size of the LUT is large. Additionally, we can note that the CDR
begins to outperform the LUT for n = 128; however, the Univ. performs worse than the
LUT. This is because the universal reconfigurable decoder is only marginally asymptotically
better than the LUT, and the constants are quite large. Hence, with the range of data
available, the point at which the Univ. becomes better than the LUT is not visible in
Figure 8.5.
As stated in Section 8.1, a limitation of this simulation was the lack of a memory gen-
erator. We made the following assumption regarding the required area of an SRAM cell in
order to predict the trends for a more realistic memory element. In the technology library
files used, the flip-flops used for memory cells had a cell area of 27.00 µm2, while a standard
CMOS inverter had a cell area of 2.00 µm2. Knowing that a standard SRAM cell is com-
101
TABLE 8.7: Integral decoder areas [µm2]
n 1-Hot LUT CDF F-Univ. CDR Univ.
4 15 387 328 583 1237 486
8 43 1261 1021 1524 2842 801
16 83 3557 2371 3366 5238 1213
32 180 9030 5624 11503 20985 3390
64 319 23400 13126 24833 40678 5565
128 596 60794 30547 54085 107023 11204
256 1142 152427 71551 - - 21083
posed of six transistors, and that a standard CMOS inverter is two transistors, we divided
the area of the sequential elements in all memory blocks in all designs by a factor of 10, that
is, we assumed that all sequential elements took up 2.70 µm2. Figure 8.6 illustrates this
recalculated area for all designs tested. As this figure demonstrates, even with a reduction
in the cost of sequential elements to a fairly low area, the configurable decoders (with the
exception of the Univ.) outperform the LUT. In addition, this brings the area required by
the configurable decoders closer to that of the 1-hot decoder.
TABLE 8.8: Integral decoder power consumptions [mW]
n 1-Hot LUT CDF F-Univ. CDR Univ.
4 2.178× 10−5 1.11× 10−3 9.340× 10−4 1.320× 10−3 1.686× 10−3 3.645× 10−3
8 6.478× 10−5 3.61× 10−3 2.867× 10−3 2.267× 10−3 4.414× 10−3 8.915× 10−3
16 1.262× 10−4 1.167× 10−2 6.840× 10−3 4.086× 10−3 1.850× 10−2 3.380× 10−2
32 2.177× 10−4 7.14× 10−2 3.800× 10−2 8.689× 10−3 0.1407 0.1771
64 3.634× 10−4 0.1872 7.660× 10−2 1.740× 10−2 0.3232 1.3642
128 5.502× 10−4 4.2476 0.5013 6.830× 10−2 4.3532 16.5308
256 1.091× 10−3 55.1795 7.7395 6.190× 10−2 - -
102
0 50 100 150 200 250 300
0
2
4
6
8
10
12
14
16
x 104
Word Size (n)
A
re
a
(µ
m
2
)
 
 
LUT
1−Hot
CDF
F−Univ
CDR
Univ
FIGURE 8.5: Integral decoder areas [µm2]
Finally, Table 8.8 and Figure 8.7 shows the power consumption of the configurable de-
coders as compared to a LUT and a 1-hot decoder. The simulation provided an estimate of
the internal cell power, leakage power, and net power of each design; the data illustrated in
Table 8.8 and Figure 8.7 is the sum of those values. Note that a LUT consumes significantly
more power for large values of n when compared to our configurable decoders; however, the
Univ. appeared to have a higher rate of power consumption as compared to the LUT for
n = 128.
8.2.2 Bit-slice Decoders
As the size of the LUT in a configurable decoder is not affected by the value of α in a bit-slice
implementation, that is, for the λ’s shown in Table 8.2, the values of x and z are as shown
in Table 8.3. The costs of the LUT for the configurable decoders are given in Table 8.9.
The mapping units used in the bit-slice configurable decoder include an FMU, an FMU
with universal decoder parameters (F-Univ.), an RMU, and a universal RMU (Univ.). Ta-
103
0 50 100 150 200 250 300
0
2
4
6
8
10
12
14
16
x 104
Word Size (n)
A
re
a
(µ
m
2
)
 
 
LUT
1−Hot
CDF
F−Univ
CDR
Univ
FIGURE 8.6: Integral decoder recalculated areas [µm2]
ble 8.10 and Figure 8.8 present the results for the area of the different mapping units. Note
that the reconfigurable mapping units include the cost of their configuration LUTs. From
this data, functions were extrapolated that allowed a prediction of the size of a mapping unit
given a value of n
α
(the regression analysis follows the method explained in Section 8.3).
Table 8.11 illustrates the area for the mod-α counter and shift registers for the bit-slice
configurable decoders for a range of n and α. Note that (a) the large cost for these elements
primarily comes from the output shift register (with n registers) and the flip-flops used by
the technology library for memory elements and (b) the lack of data for certain points is
typically a result of values of z
α
< 1.
Tables 8.12–8.15 and Figures 8.9–8.12 illustrate the results of the simulation for the bit-
slice configurable decoders; that is, the combined area of the LUTs from Table 8.9, the
mod-α counter and shift registers from Table 8.11, and the mapping unit area derived from
the data of Table 8.10. If we compare the area of a bit-slice CDF with an integral CDF,
104
0 50 100 150 200 250 300
0
10
20
30
40
50
60
Word Size (n)
P
ow
er
(m
w
)
 
 
LUT
1−Hot
CDF
F−Univ
CDR
Univ
FIGURE 8.7: Integral decoder power consumption [mW]
we can note that for n = 256, α = 32, a bit-slice CDF has an area of 78393 µm2, while an
integral CDF has an area of 71551 µm2. This is because the LUT is especially large for the
CDF (19×113, from Table 8.2), and the bit-slice CDF imposes an additional 256-bit register.
Combined with the flip-flop area penalty of our technology, this results in a construction that
is actually more costly than the integral CDF. However, the bit-slice universal decoder, for
n = 128, α = 32, has an area of 21015 µm2 while an integral universal decoder for the same
value of n has an area of 54085 µm2. This is again because of the size of the preceding LUT;
here the size of the LUT is only 19 × 8 (see Table 8.5). However, we expect the bit-slice
configurable decoder to be advantageous where source strings are reduced substantially for a
given application. For these cases, the size of the LUT and mapping units becomes smaller.
105
TABLE 8.9: LUT areas [µm2] in a bit-slice configurable decoder
n CDF CDR Univ. F-Univ.
4 296 296 402 402
8 957 957 657 657
16 2243 2231 925 925
32 5144 5152 1758 1758
64 12166 12171 2297 2305
128 28627 28752 3960 4161
256 67711 - 4907 -
TABLE 8.10: Mapping unit areas [µm2]
n FMU RMU Univ. RMU FMU (F-Univ.)
4 32 287 835 84
8 64 567 2185 144
16 128 1135 4333 288
32 480 6351 19223 1632
64 960 12662 38373 3268
128 1920 25333 108636 7244
256 3840 - - 16176
106
0 50 100 150 200 250 300
0
2
4
6
8
10
12
x 104
Word Size (n)
A
re
a
(µ
m
2
)
 
 
FMU
FMU (F−Univ.)
RMU
Univ. RMU
FIGURE 8.8: Mapping unit area (µm2)
TABLE 8.11: Area (µm2) for mod-α counter and shift registers, 2 ≤ logn < 256, 1 ≤ logα <
6
n / α 2 4 8 16 32
4 424 - - - -
8 817 756 - - -
16 1427 1342 1424 - -
32 2639 2508 2566 - -
64 4980 4844 4856 5022 -
128 9654 9364 9145 8989 8812
256 18833 18234 18023 18065 17846
107
TABLE 8.12: Bit-slice CDF area (µm2)
n / α 2 4 8 16 32
4 737 - - - -
8 1815 1754 - - -
16 3762 3677 3759 - -
32 7983 7852 7910 - -
64 17572 17436 17448 17614 -
128 39177 38887 38668 38512 34939
256 88414 87815 83188 87646 78393
0
100
200
300
0
10
20
30
40
0
2
4
6
8
10
x 104
Word size (n)
α
A
re
a
(µ
m
2
)
FIGURE 8.9: Bit-slice CDF area [µm2]
108
TABLE 8.13: Bit-slice CDR area (µm2)
n / α 2 4 8 16 32
4 438 - - - -
8 1728 1667 - - -
16 4343 4258 4340 - -
32 10076 9945 10003 - -
64 22729 22593 22605 22771 -
128 50639 50349 50130 49974 46401
0
50
100
150
0
10
20
30
40
0
1
2
3
4
5
6
x 104
Word size (n)
α
A
re
a
(µ
m
2
)
FIGURE 8.10: Bit-slice CDR area [µm2]
109
TABLE 8.14: Bit-slice Univ. area (µm2)
n / α 2 4 8 16 32
4 - - - - -
8 2415 2354 - - -
16 4791 4706 4788 - -
32 10133 10002 10060 - -
64 22537 22401 22413 22579 -
128 55044 54754 54535 54379 50806
The lack of data for n = 4 arose from a discontinuity
in the regression function for the MU
0
50
100
150
0
10
20
30
40
0
1
2
3
4
5
6
x 104
Word size (n)
α
A
re
a
(µ
m
2
)
FIGURE 8.11: Bit-slice Univ. area [µm2]
110
TABLE 8.15: Bit-slice F-Univ. area (µm2)
n / α 2 4 8 16 32
4 - - - - -
8 1497 1436 - - -
16 2472 2387 - - -
32 4926 4795 4853 - -
64 8694 8558 8570 - -
128 16897 16607 16388 16232 -
256 31036 30437 28810 30268 21015
The lack of data for n = 4 arose from a discontinuity
in the regression function for the MU
0
100
200
300
0
10
20
30
40
0
1
2
3
4
x 104
Word size (n)
α
A
re
a
(µ
m
2
)
FIGURE 8.12: Bit-slice F-Univ. area [µm2]
111
8.3 Regression Analysis Results
In order to determine the values of the constants hidden by the asymptotic notation of the
gate cost of the configurable decoders, we performed a nonlinear least squares regression
analysis using the Trust Algorithm in Matlab [20]. Let n be a value used for the simulation
(for example 4, 8, 16, . . .). Let D(n) be a data corresponding to n; for example, if we are
considering the area of a CDF, then from Table 8.7, D(4) = 328. The aim is to use the data
points available to generate a function f(n) that fits the data. For this purpose, the regression
has to be supplied a set of functions f1(n), f2(n), . . . , fk(n) such that f(n) =
k∑
i=1
aifi(n) would
be a likely representation of the function we seek. Moreover, the value of k should be
somewhat smaller than the number of data points to get a reasonably good fit. In order
to determine whether the function is a good fit, the regression tool minimizes the quantity∑
n
(D(n)− f(n))2. For all our regression analysis, we used k ≤ (4, 5) as the number of data
points was around 8.
The various modules constructed in this thesis have complex cost function represen-
tations. For example, the number of gates in a 2z × m LUT (see Section 3.2.4) has the
form a1m2
z + a2z2
z + a3m + a42
z + a5z + a6, where a1 . . . a6 are constants. Translated to
a
 log2 n
log log n
× n LUT used in the pure LUT-based solution (see Section 8.1) this results in
many different functions of n. Our analysis in Section 3.2.4 simply accounts for the fastest
growing term and ascertains that the gate cost of this LUT is Θ
(
n log2 n
log logn
)
. However, other
terms may be significant. For this LUT, we use the functions
f1(n) =
n log2 n
log log n
, f2(n) = log
2 n, f3(n) = 1, f4(n) = n, f5(n) =
log2 n
log log n
.
Our choice of these 5 functions from among the 8 that make up an analytical formula for
the cost is based on what we believe would be the most significant terms. We always include
the fastest growing term and the constant function. We recognize that very slow growing
functions such as log logn are nearly constant over the range of values of n considered.
Therefore, we select only one among a set of functions such as n, n log log n, n
log logn
, as we
112
do not expect a significantly different nonconstant contribution from these. Note that for
a 1-hot decoder, as n becomes large, many of the AND gates become redundant and are
eliminated. This technique (known as predecoding, see [18, 26]) reduces the number of gates
by a constant factor, resulting in a small coefficient for the asymptotic gate cost function
n logn.
TABLE 8.16: Functions used in regression analysis for each module
Module f1(n) f2(n) f3(n) f4(n) f5(n)
1-Hot n logn 1 - - -
LUT n log
2 n
log logn
log2 n 1 n log
2 n
log logn
CDF n logn n1− log2 n log n n 1
CDR n logn n1− log2 n log n n 1
Univ. n log
2 n
log logn
log4 n
(log logn)3
log log n 1 -
F-Univ. n log
2 n
log logn
log4 n
(log logn)3
log log n 1 log2 n
 was approximated to 0.85 for all simulations
Table 8.16 shows the values of the functions f1 . . . f4,5 used for the different modules and
Table 8.17 shows the constants obtained from the regression analysis. This table also shows
the “relative error” (which equals the average value of the residual error
|D(n)− f(n)|
f(n)
over
all n).
TABLE 8.17: Constants found from regression analysis for each module
Module a1 a2 a3 a4 a5 Error
1-Hot 0.543 61.8 - - - 0.263
LUT 31.1 -26.1 291 -67.5 59.6 0.077
CDF 51.8 55.9 279 -170 13.86 0.0098
CDR 27.9 180 -1919 161 2471 0.086
Univ. 46.04 3.02 325 170 - 0.097
F-Univ 2.23 78.61 -100 -391 -6.67 0.0817
113
0 0.5 1 1.5 2
x 104
0
0.5
1
1.5
2
2.5
3
3.5
4
x 107
Word size (n)
A
re
a
(µ
m
2
)
 
 
1−Hot
LUT
CDF
CDR
Univ.
F−Univ.
FIGURE 8.13: Integral decoder expected area (µm2) under regression analysis
The functions outlined above are illustrated in Figure 8.13. While most of the trends are
as expected, there are some interesting cases to note. First, at around n = 8192, the CDR
begins to outperform the CDF. This clearly should not be the case, as the CDR contains all
elements of the CDF as well as an additional configuration LUT. Additionally, the functions
derived fail to demonstrate the asymptotic cost of the Univ. decoder, as for very large
values of n the Univ. decoder outperforms the LUT. Regardless of these inconsistencies, the
functions derived provide an indication as to the general trend of the configurable decoders;
as expected, our configurable decoders (with the exception of the Univ. decoder, as noted)
consistently outperform the pure LUT-based configurable decoder.
114
Chapter 9
Parallel Configurable Decoder
In this chapter we introduce a variant on the configurable decoder, a parallel configurable
decoder (CD(x,z,y,n,α,P )), that utilizes a merge operation (such as an associative Boolean
operation) to combine the outputs of two or more configurable decoders. The parameter P
denotes the number of configurable decoders connected in parallel in CD(x,z,y,n,α,P ). This
parallel configurable decoder is an interesting case that can produce sets of subsets of Zn
not easily produced by the configurable decoders previously presented.
9.1 An Illustrative Example
We begin our discussion of the parallel configurable decoder through the set of 1-hot subsets,
which is not easily produced by the configurable decoders of Chapter 6 but can be produced
rather easily using a parallel variant. We first consider two sets S0, S1 of subsets of Zn.
Assume an integer m that divides n so that n = km for some integer k ≥ 1. Then Zn =
{0, 1, . . . , m − 1, m, . . . , 2m − 1, . . . , im, . . . , (i + 1)m − 1, . . . , (k − 1)m, . . . , km − 1}. For
0 ≤ i < m and 0 ≤ j < n
m
, let
qi,0 = {i+ ` : 0 ≤ ` < k}
and let
qj,1 = {jm+ ` : 0 ≤ ` < m}.
Clearly, qi,0 and qi,1 are subsets of Zn. Table 9.1 illustrates the subsets for n = 20 and m = 4.
Let S0 = {qi,0 : 0 ≤ i < m} and S1 = {qj,1 : 0 ≤ j < nm}. It is easy to verify that S0 and
S1 induce partitions pi0 = {qi,0 : 0 ≤ i < m} and pi1 = {qj,1 : 0 ≤ j < nm}. So, for z = m = nm ,
two z-partitions of n can generate these subsets in a configurable decoder of the form shown
115
TABLE 9.1: Subsets qi,0 and qi,1 for n = 20 and m = 4
qi,0 n-bit string
q0,0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1
q1,0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0
q2,0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0
q3,0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0
qj,1 n-bit string
q0,1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1
q1,1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0
q2,1 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0
q3,1 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0
q4,1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
in Chapter 6 (see Theorem 6.3, page 79). Put differently, each subset of S0 and S1 can be
independently generated by different configurable decoders using just one partition each.
Lemma 9.1 For all 0 ≤ i, j < m and 0 ≤ j < n
m
,
qi,0 ∩ qj,1 = {jm+ i}.
Proof: Consider 0 ≤ x < n. If x ∈ qi,0 ∩ qj,1, then there exists integers 0 ≤ ` < nm and
0 ≤ `′ < m such that x = i + `m = jm + `′. This implies that (` − 1)m + (i − `′) = 0.
Without loss of generality, let ` ≥ j. Clearly, i − ` > −m. Then, for (` − j)m+ (i − `′) to
be 0, ` = j and i = `′. So, x = i+ jm. Also, i = x mod m and j = x mod n
m
implies that i
and j are unique for a given x. Thus, qi,0 ∩ qj,1 = {i+ jm}.
Corollary 9.1 For each x ∈ Zn, there exists unique values 0 ≤ i < m and 0 ≤ j < nm such
that x ∈ qi,0 ∩ qj,1.
116
∩CD(x,z,y,n,α,0)
CD(x,z,y,n,α,1)
n
Q
Q1
Q0
n1
n0
 
 
 
 
 
 
-
6
?
x1 = log
n
m
A1   
-
x0 = logm
A0   
-
FIGURE 9.1: A parallel configurable decoder that generates the 1-hot subset of Zn
Proof: Since i = x mod m and j = x mod n
m
are unique for a given x by Lemma 9.1,
x ∈ qi,0 ∩ qj,1.
As a direct consequence of Lemma 9.1 and Corollary 9.1, we have the following result.
Theorem 9.1 S = {qi,0 ∩ qj,1 : 0 ≤ i < m and 0 ≤ j < nm} is the set of 1-hot subsets.
A simple method to generate the 1-hot subsets is illustrated in Figure 9.1. If m =
√
n,
then bothm and n
m
form feasible values for the input for a mapping unit; that is, z = m = n
m
.
We do not need a y input as only 1 partition is used (a y input would allow additional subsets
to be generated from additional partitions however). Thus, for the configurable decoders,
x0 = logm = log
√
n = log n
m
= x1 and z0 = m =
√
n = n
m
= z1 and y0 = y1 = 0.
Clearly, n0 = n1 = n. Both configurable decoders use a single partition, hardwired into their
respective mapping units (see Figure 9.2).
The cost of each configurable decoder is the cost of a
√
n×√n LUT with a CD(1
2
logn,
√
n,0,n,0,1)
which is Θ(n). Clearly, increasing y0 and y1 to any constant will increase the number of sub-
sets produced without altering the Θ(n) gate cost. It is easy to verify that two smaller log
√
n
to
√
n 1-hot decoders arranged as shown in this example will also produce a larger logn to
117
Hardwired partition for S0
Hardwired partition for S1
FIGURE 9.2: Hardwired partitions in the parallel configurable decoder generating the 1-hot
subset of Zn
n 1-hot decoder with O(n) cost. However, our approach offers room for additional partitions
and hence additional subsets (within the same cost) and considerably higher flexibility.
9.2 General Observations
In general, a P -element CD(x,z,y,n,α,P ) (see Figure 9.3) uses P configurable decoders, CD0,
CD1, . . ., CDP−1 in parallel where CDi is a CD(xi,zi,yi,ni,αi,i). Two CDs, say CDi and
CDj, may use the same input bit for their LUT; that is, the set of xi bits to CDi and the
set of xj bits to CDj could have common bits. Therefore,
P−1∑
i=0
xi ≥ x, as each input bit is
assumed to be used at least once. We also have xi ≤ x. Similarly, yi ≤ y,
P−1∑
i=0
yi ≥ y, ni = n
and
P−1∑
i=0
ni ≥ n.
The merge unit could perform functions ranging from set operations (where ni = n, for
all i) to simply rearranging bits (when
P−1∑
i=0
ni = n). The (optional) control allows it to select
from a range of options.
118
A0
yP−1
xP−1
yi
xi
y0
x0
(optional)
control
y
x
B
A
...
...
Q
 
 
 
 
 
 
B0
...
...
...
...
...
...
CD0
n
nP−1
ni
n0
CDP−1
CDi
QP−1
Qi
Q0
BP−1
AP−1
Bi
Ai
 
 
-
?
Merge Unit
CD(x,z,y,n,α,P − 1)
CD(x,z,y,n,α,i)
CD(x,z,y,n,α,0)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6 6
-
-
-
-
-
-
-s
s
s
s
s
s
FIGURE 9.3: A parallel configurable decoder CD(x,z,y,n,α,P )
Let CDi have a delay of Di and a gate cost of Gi. If DM and GM are the delay and
gate costs of the merge unit, then the delay D and gate cost G of the parallel configurable
decoder CD(x,z,y,n,α,P ) is
D = max(Di) +DM +O(logP )
G =
P−1∑
i=0
(Gi) +GM +O(P (x+ y)).
If the merge unit uses simple associative set operations (such as Union, Intersection, Ex-
OR) that correspond to bit-wise logical operations, than DM = O(logP ) and GM = O(nP ).
119
Since x+ y ≤ n, the overall cost and delay for this structure is
D = max(Di) +O(logP )
G =
P−1∑
i=0
(Gi) + nP.
Clearly, each CDi can produce its own independent set of ni-bit outputs. The manner in
which these outputs combine depends on the merge unit. For example, let each CDi produce
an n-bit output (that is, a subset of Zn) and let Si be the independent set of subsets produced
by CDi. Let the merge operations be ◦, an associative set operation with identity S0 (that
is, for any set S, S ◦ S0 = S0 ◦ S = S; Intersection, Union, and Ex-OR represent such
an operation with Zn, ∅, and ∅, respectively as the identities). If each CDi also produces
S0, then the whole configurable decoder CD(x,z,y,n,α,P ) produces an independent set that
includes
P−1⋃
i=0
Si.
For example, an element S ∈ S0 can be produced as S ◦ S0 ◦ S0 ◦ . . . ◦ S0︸ ︷︷ ︸
P − 1 times
. Clearly, the
CD(x,z,y,n,α,P ) produces many more dependent subsets.
120
Chapter 10
Conclusions
In this thesis, we have addressed the pin limitation constraint in IC chips (particularly
FPGAs) by providing a fast, flexible, and scalable configurable decoder that bridges the gap
between the inexpensive, but inflexible, fixed decoders and the flexible, but expensive, pure
LUT-based configurable decoders. As demonstrated in Chapter 6, for a fixed gate cost of G
and when G
n
is polylogarithmically bounded in n, we outperform the LUT by producing an
Ω
(
log n
log log n
)
factor more independent subsets than the pure LUT-based configurable decoder
and significantly more dependent subsets. If G
n
is not polylogarithmically bounded in n, we
still produce the same order of independent subsets as the pure LUT-based configurable
decoder, but continue to provide significantly more dependent subsets not producible by the
LUT solution. The contributions of this work can be summarized as follows.
We demonstrated an interesting fixed decoder (called the mapping unit) that uses multi-
casts as a way of expanding information from z-bits to n-bits. We formally represented these
multicasts as ordered partitions of an n-set. Bounds on its capabilities were derived, includ-
ing the minimum number of independent subsets producible from a mapping unit decoder.
We presented a method to produce the maximum possible number of dependent subsets.
Several realizations of the mapping unit were presented (fixed, reconfigurable, bit-slice)
that offered various trade-offs between speed, cost, and number of independent subsets.
The various mapping unit realizations are melded with the flexibility afforded by a LUT to
generate a range of configurable decoders. The functionality of the mapping units allows the
cost of the LUTs to be lowered, allowing a solution that has a low gate cost, low delay, but
high degree of flexibility.
We applied our results to subsets generated by some well-known classes of communication
patterns (binary tree based reduction, ASCEND/DESCEND communications, and 1-hot
subsets). We presented extensive simulation results for our designs. The simulation data was
used to predict the constants hidden by the asymptotic notation and the future cost trends
121
for large values of n. These trends suggest that our method will continue to outperform the
pure LUT-based solution. We also introduced a generalization of the configurable decoder,
a parallel configurable decoder, and made some observations for it. We show its utility by
demonstrating how certain sets of subsets that are difficult for our original design can be
effectively produced on the generalization.
We now highlight some other ideas and variants that were explored in the course of this
research. While these designs did not result in cost-effective solutions, we present them here
as a means to guide future research in this area.
10.1 Other Configurable Decoder Variants
The other variants explored during the course of this research were (1) a serial configurable
decoder and (2) a recursive bit-slice configurable decoder. These variants were discarded as
they did not provide any benefit over the designs included in Chapter 6. We provide some
observations about their limitations here.
A Serial Configurable Decoder: In a serial configurable decoder, shown in Figure 10.1,
Q
n
MU (z1,y1,n,α)MU (z0,y0,z1,α)
A
x
LUT
y1
y0
B1
B0
 
 
-- 
 
z0
- 
 
z1
6 6
 
 
 
 
- 
 
FIGURE 10.1: A serial configurable decoder variant
two or more mapping units are cascaded to construct the subsets of Zn (here we will restrict
ourselves to examining only two mapping units; extrapolating these results to more than
two mapping units would not be difficult). By the definition of a mapping unit decoder,
x  z0  z1  n. Note that the independent subsets produced by the second mapping
122
 
 
 
 
-
z
z
α0
z
α1
z
αk−1
n
αk−1
n
α0
n
α1 n
MU ( z
αk
,y, n
αk
,αk)
clock 1
clock 0
clock k − 1
 
 - -
-
-
 
 
 
 
 
 
 
 
 
 
FIGURE 10.2: A conceptual view of a recursive bit-slice configurable decoder. Note that
αi = α0α1 . . . αi−1.
unit are dependent on what is provided to it, that is, the range of values of z1, which is
in turn dependent on the number of independent subsets produced by the first mapping
unit. Thus, since the first mapping unit in Figure 10.1 can produce 2y0 blog z0c independent
subsets, where z0 is a relatively small value, a single LUT can usually subsume both the
LUT and the first mapping unit in the serial variant, and be within the gate cost of the
second mapping unit and provide more independent subsets.
A Recursive Bit-Slice Configurable Decoder: In a recursive bit-slice configurable
decoder, illustrated in Figure 10.2, two or more bit-slice mapping units are nested within
each other, such that an input to the first bit-slice configurable decoder is broken down by a
factor of α0, then broken down by a factor of α1, and so on, until it reaches the lowest level
mapping unit. It is then reconstructed to an n-bit output. However, this is not a worthwhile
construction, as the large number of shift registers and multiple clocks result in a complex
123
construction, and the linear (with α) reduction of cost does not provide any benefit that a
single bit-slice decoder does not.
10.2 Future Directions
While this thesis has demonstrated a measurable performance gain over pure LUT-based
configurable decoders, there is a rich variety of future directions that can be explored in this
area.
Mapping Units: The mapping units presented in this thesis are one manner of expanding
the output of a smaller LUT to the n-bit output. In fact, any inexpensive z to n decoder will
do. Are there other approaches to constructing a decoder that acts as a mapping unit? In
addition to this, our realizations of the mapping unit represent several ways of constructing
a multicasting module; are there other ways of realizing this operation?
Parallel Configurable Decoders: The initial investigation of the parallel configurable
decoder (Chapter 9) has shown promise. Future directions in this area include a deeper
exploration of the number of and the types of subsets produced by any merge unit, by a
merge unit implementing simple set operations, and a range of different types of merge units
for different operations.
Applications: The configurable decoder, while presented for a reconfigurable system, is a
more general technique for alleviating the pin limitation problem. What other applications
could benefit from this work? We identify two such applications below.
Sensor Networks: A configurable decoder (and a reverse encoder) can serve to reduce
the number of bits transmitted between sensor nodes without requiring a drastic redesign of
the sensor nodes.
External Power Controllers: The configurable decoder works to select a subset.
This can be used by a smart agent (perhaps a chip) that observes data from a collection
124
of chips and issues commands to selectively power-down portions of these chips. A sharp
focused selection (such as that afforded by the configurable decoder) could be useful here.
125
Bibliography
[1] A. Ali and R. Vaidyanathan, “Exact Bounds on Running ASCEND/DESCEND and
FAN-IN Algorithms on Synchronous Multiple Bus Networks,” IEEE Transactions on
Parallel and Distributed Systems, vol. 7, no. 8, pp. 783–790, August 1996.
[2] Atmel Corp., “AT6000 Series Configuration,” Configuration Guide, 1997.
[3] J. Babb, R. Tessier, and A. Agarwal, “Virtual wires: overcoming pin limitations in
FPGA-based logic emulators,” Proceedings of the IEEE Workshop on FPGAs for Cus-
tom Computing Machines, April 1993, pp. 142–151.
[4] K. Bondalapati and V. K. Prasanna, “Reconfigurable Computing Systems,” Proc. of
the IEEE, vol. 90, no. 7, July 2002, pp. 1201–1217.
[5] S. Brown and J. Rose, “FPGA and CPLD Architectures: A Tutorial,” IEEE Design
and Test of Computers, vol. 13, 1996, pp. 42–57.
[6] S. Brown and Z. Vranesic, Fundamentals of Digital Logic with VHDL Design, McGraw-
Hill Companies, Inc., Boston, Massachusetts, 2000.
[7] PKS User Guide, Product Version 5.0, May 2002.
[8] Cadence NC-Verilog Simulator Help, Product Version 5.4, November 2004.
[9] P. Chow, S. Ong Seo, J. Rose, K. Chung, G. Paez-Monzon, and I. Rahardja, “The
Design of an SRAM-Based Field-Prorammable Gate Array - Part I: Architecture,” IEEE
Transactions on VLSI Systems, Vol. 7, No. 2, June 1999, pp. 191-197.
[10] M. D. Ciletti, Modeling, Synthesis, and Rapid Prototyping with the Verilog HDL,
Prentice-Hall, New Jersey, 1999.
[11] H. P. Dharmasena and R. Vaidyanathan, “Lower Bounds on the Loading of Multiple
Bus Networks for Binary Tree Algorithms,” IEEE Transactions on Computers, Vol. 53,
No. 12, December 2004, pp. 1535–1546.
[12] H. M. E. El-Boghdadi, “On Implementing Dynamically Reconfigurable Architecture”,
Ph.D. dissertation, Dept. of Electrical and Computer Eng., Louisiana State University,
2003.
[13] M. Gokhale, P. Graham, E. Johnson, N. Rollins, and M. J. Wirthlin, “Dynamic Re-
configuration for Management of Radiation-Induced Faults in FPGAs,” Proc. Reconfig-
urable Architectures Workshop, Int. Parallel and Distributed Processing Symp., 2004.
[14] J. L. Hennessy and D. A. Patterson, Computer Architecture: A Quantitative Approach,
3rd. Ed., Morgan Kauffman, San Francisco, CA, 2003.
[15] Intel Corporation, Microprocessor Quick Reference Guide,
http://www.intel.com/pressroom/kits/quickreffam.htm, 2003.
[16] J. Ja´Ja´, An Introduction to Parallel Algorithms, Edison Wesley, Reading, MA, 1992.
126
[17] C. L. Liu, Elements of Discrete Mathematics, 2nd. Edition, McGraw-Hill, Inc., New
York, 1985.
[18] R. Lyon and R. Schediwy, “CMOS static memory with a new four-transistor memory
cell,” Proc. Advanced Research in VLSI, March 1987, pp. 111–132.
[19] P. Mal, J. F. Cantin, and F. R. Beyette, “The Circuit Designs of an SRAM Based
Look-Up Table for High Performance FPGA Architecture,” The 2002 45th Midwest
Symposium on Circuits and Systems, vol. 3, August 2002, pp. 227–230.
[20] The MathWorks, “Curve Fitting Toolbox User’s Guide,” Version 1.0,
available at: http://www.mathworks.com/access/helpdesk/help/pdf doc/curvefit/curvefit.pdf
[21] R. Sidhu, S. Wadhwa, A. Mei, and V.K. Prasanna, “A Self-Reconfigurable Gate Array
Architecture,” Intl. Conf. on Field Programmable Logic and Applications, 2000, Springer
Verlag Lecture Notes in Comupter Sc., vol. 1896, pp. 106–120.
[22] R. Vaidyanathan and J. Trahan, Dynamic Reconfiguration: Architectures and Algo-
rithms, New York: Kluwer Academic / Plenum Publishers, 2003.
[23] J. Van Campenhout, H. Van Marck, J. Depreitere, J. Dampre, “Optoelectronic FPGAs,”
IEEE Journal of Selected Topics in Quantum Electronics, Vol. 5, No. 2, March - April
1999, pp. 306–315.
[24] J. Van Campenhout, “Solving the Interconnect Bottleneck. Optoelectronic FP-
GAs,” Broadband Optical Networks and Technologies: An Emerging Reality/Optical
MEMS/Smart Pixels/Organic Optics and Optoelectronics, 1998 IEEE/LEOS Summer
Topical Meetings, July 1998.
[25] H. Van Marck, J. Depreitere, D. Stroobandt, and J. Van Campenhout, “A Quantitative
Study of the Benefits of Area-I/O in FPGAS,” Proc. of the 8th Great Lakes Symposium
on VLSI, Febraury 1998, pp. 392–399.
[26] N. Weste and D. Harris, CMOS VLSI Design: A Circuits and Systems Perspective,
Third Ed., Boston: Person Education, Inc., 2005.
[27] M. J. Wirthlin and B. L. Hutchings, “DISC: The Dynamic Instruction Set Computer,”
Field Programmable Gate Arrays (FPGAs) for Fast Board Development and Reconfig-
urable Computing, J. Schewel, ed., Proceedings of SPIE, vol. 2607, 1995, pp. 92–103.
[28] Xilinx Inc., “Virtex-5 User Guide,”
available at: http://direct.xilinx.com/bvdocs/userguides/ug190.pdf.
[29] Xilinx Inc., “Virtex-5 Configuration User Guide,”
available at: http://direct.xilinx.com/bvdocs/userguides/ug191.pdf.
127
Vita
Matthew Collin Jordan was born on November 25 1981, in Lansing, Michigan. In May 2004
he graduated cum laude from Michigan Technological University with a Bachelor of Science
in Computer Engineering. Subsequently he joined the graduate program in the Department
of Electrical and Computer Engineering at Louisiana State University. He is expected to
receive his Master of Science in Electrical Engineering in August of 2006.
128
