Reconfigurable Architectures for General-Purpose Computing, AI by André Dehon
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
ARTIFICIAL INTELLIGENCE LABORATORY
A.I. Technical Report No. 1586 October, 1996
Reconﬁgurable Architectures for General-Purpose Computing
Andr´ e DeHon
andre@mit.edu
Abstract: General-purpose computing devices allow us to (1) customize computation after
fabrication and (2) conserve area by reusing expensive active circuitry for different functions in
time. We deﬁne RP-space, a restricted domain of the general-purpose architectural space focussed
on reconﬁgurable computing architectures. Two dominant features differentiate reconﬁgurable
from special-purpose architectures and account for most of the area overhead associated with RP
devices: (1) instructions which tell the device how to behave, and (2) ﬂexible interconnect which
supports task dependent dataﬂow between operations.
WecancharacterizeRP-spacebytheallocationandstructureoftheseresourcesandcomparethe
efﬁciencies of architectural points across broad application characteristics. Conventional FPGAs
fallatoneextreme endofthis spaceand theirefﬁciencyranges overtwo ordersof magnitudeacross
the space of application characteristics. Understanding RP-space and its consequences allows us
to pick the best architecture for a task and to search for more robust design points in the space.
Our DPGA, a ﬁne-grained computing device which adds small, on-chip instruction memories
to FPGAs is one such design point. For typical logic applications and ﬁnite-state machines, a
DPGA can implement tasks in one-third the area of a traditional FPGA. TSFPGA, a variant of the
DPGA which focuses on heavily time-switched interconnect, achieves circuit densities close to the
DPGA, while reducing typical physical mapping times from hours to seconds.
Rigid, fabrication-time organization of instruction resources signiﬁcantly narrows the range
of efﬁciency for conventional architectures. To avoid this performance brittleness, we developed
MATRIX,the ﬁrstarchitectureto deferthebindingofinstructionresourcesuntilrun-time,allowing
the application to organize resources according to its needs. Our focus MATRIX design point is
based on an array of 8-bit ALU and register-ﬁle building blocks interconnected via a byte-wide
network. With today’s silicon, a single chip MATRIX array can deliver over 10 Gop/s (8-bit
ops). On sample image processing tasks, we show that MATRIX yields 10-20 the computational
density of conventional processors.
Understanding the cost structure of RP-space helps us identify these intermediate architectural
points and may provide useful insight more broadly in guiding our continual search for robust and
efﬁcient general-purpose computing structures.
Acknowledgements: This report describes research done at the Artiﬁcial Intelligence Laboratory of the Mas-
sachusetts Institute of Technology. This research is supported by the Advanced Research Projects Agency of the
Department of Defense under Rome Labs contract number F30602-94-C-0252.Acknowledgments
Thiseffort grew out theintellectualbackdropof theTransitandAbacus projects. Years prototyping
Transit machines with Tom Simon and his specialization philosophy set the stage for my initial
interest in FPGAs for computing. The initial ideas for the DPGA grew out of dialogs with Mike
Bolotski in which we tried to reconcile Abacus, his SIMD architecture which he described as “a
bunchofone-bitprocessors,”withFPGAs,whichlookedtomelike“abunchofone-bitprocessors.”
Tom Knight has been my research advisor since I was a junior. He has always encouraged me
to focus on thebig ideaand has been supportive as I explored sometimesradical points of view. He
gave me plenty of freedom to do the right thing, and hopefully, I have lived up to the conﬁdence
and trust implied by that autonomy.
The efforts of Jeremy Brown, Derrick Chen, Ian Eslick, Ethan Mirsky, and Edward Tau during
and after 6.371 made the DPGA prototype possible. Ian’s perseverance to ﬁnalize the layout and
veriﬁcation was particular responsible for the completion of that effort. Ed and Ian both helped see
the DPGA prototype through its ﬁnal postmortem.
TSFPGA and MATRIX were both possible only because of Derrick Chen and Ethan Mirsky,
the Master of Engineering students who respectively took ownership of the microarchitecture and
VLSI portions of those designs. We were largely able to complement each other’s efforts in our
attempts to understand and develop these architectures.
Discussion with Rich Lethin, Russ Tessier, and Jonathan Babb at MIT were useful in focusing
in on the key issues which needed addressing.
Regular interaction with the emerging reconﬁgurable computing community was valuable for
encouragementandforidentifyingkeyproblemsandissues. Notably,discussionswithBradTaylor,
Mike Butts, Brad Hutchings, Bill Magione-Smith,John Villasenor, Phil Kuekes,Steve Trimberger,
Mike Smith, and Carl Ebling have been helpful in identifying the questions which need answers
and cleaning up ideas for presentation.
Thomas McDermott provided valuable feedback on the early chapters of this work.
The availabilityof high-quality,experimentalCAD toolsin sourceformfromuniversitiesmade
the experimental mapping work done here feasible. University of Toronto’s Chortle provided a
clean basis for several early experiments in DPGA synthesis. UC Berkeley’s SIS was used for
standard, technology independent circuit mapping. UC Berkeley’s mustang was the workhorse
behind multicontext FSM mapping.
This research was supported by the Advanced Research Projects Agency of the Department of
Defense under Rome Labs contract number F30602-94-C-0252.
iContents
I Introduction and Background 1
1 Overview and Synopsis 2
1.1 Evolution of General-Purpose Computing with VLSI Technology 2
1.2 This Thesis 3
1.3 Reconﬁgurable Device Characteristics 5
1.4 Conﬁgurable, Programmable, and Fixed-Function Devices 5
1.5 Key Relations 7
1.6 New General-Purpose Architectures 8
1.7 Prognosis for the Future 11
2 Basics and Terminology 12
2.1 General-Purpose Computing 12
2.2 General-Purpose Computing Issues 13
2.2.1 Interconnect 13
2.2.2 Instructions 13
2.3 Programmables and Conﬁgurables 13
2.4 FPGA Introduction 15
2.5 Regular and Irregular Computing Tasks 17
2.6 Metrics: Density, Diversity, and Capacity 17
2.6.1 Functional Density 18
2.6.2 Functional Diversity – Instruction Density 20
2.6.3 Data Density 20
3 Reconﬁgurable Computing Background 22
3.1 Successes of Reconﬁgurable Computing 22
3.1.1 Programmable Active Memories 22
3.1.2 Splash 22
3.1.3 PRISM 23
3.1.4 Logic Emulation 23
3.2 Lineage 23
3.3 Technological Enablers 24
iiII Empirical Review 26
4 Empirical Review of General Purpose Computing Architectures in the Age of MOS
VLSI 27
4.1 Processors 27
4.2 VLIW Processors 33
4.3 Digital Signal Processors (DSPs) 34
4.4 Memories 35
4.5 Field-Programmable Gate Arrays (FPGAs) 41
4.6 Vector and SIMD Processors 44
4.7 Multimedia Processors 47
4.8 Multiple Context FPGAs 48
4.9 MIMD Processors 49
4.10 Reconﬁgurable ALUs 50
4.11 Summary 51
5 Case Study: Multiply 54
5.1 Custom Multipliers 54
5.2 Semicustom Multipliers 54
5.3 General-Purpose Multiply Implementations 56
5.4 Hardwired Functional Units in “General-Purpose Devices” 57
5.5 Multiplication Granularity 58
5.6 Specialized Multiplication 58
5.7 Summary 59
6 High Diversity on Reconﬁgurables 60
III Structure and Composition of Reconﬁgurable Computing Devices 62
7 Interconnect 63
7.1 Dominant Area and Delay 63
7.1.1 Fixed Area 63
7.1.2 Interconnect and Conﬁguration Area 64
7.1.3 Delay 64
7.2 Problems with “Simple” Networks 66
7.2.1 Crossbars 66
7.2.2 Multistage Networks 67
7.2.3 Mesh Interconnect 67
7.3 Issues in Reconﬁgurable Network Design 68
7.4 Conventional Interconnect 69
7.5 Switch Requirements for FPGAs with 100-1000 LUTs 71
7.6 Channel and Wire Growth 72
7.6.1 Rent’s Rule Based Hierarchical Interconnect Model 72
iii7.6.2 Wire Growth in Rent Hierarchy Model 74
7.6.3 Switch Growth in Rent Hierarchy Model 75
7.7 Network Utilization Efﬁciency 79
7.8 Interconnect Description 89
7.8.1 Weak Upper Bound 89
7.8.2 Structure Based-Estimates 90
7.8.3 Signiﬁcance and Impact 92
7.8.4 Instruction Growth versus Interconnect Growth 94
7.9 Effects of Interconnect Granularity 97
7.9.1 Wiring 97
7.9.2 Switches 98
7.10 Summary 99
8 Instructions 100
8.1 General Case Example 100
8.2 Bits per Instruction 102
8.3 Compressing Instruction Stream Requirements 103
8.3.1 Wide Word Architectures 103
8.3.2 Broadcast Single Instruction to Multiple Compute Units 103
8.3.3 Locally Conﬁgure Instruction 103
8.3.4 Broadcast Instruction Identiﬁer, Lookup in Local Store 104
8.3.5 Encode Length by Likelihood 105
8.3.6 Mode Bits for Early Bound information 105
8.3.7 Themes 106
8.4 Compressibility 107
8.5 Control Streams 108
8.6 Instruction Stream Taxonomy 109
8.7 Summary 110
9 RP-space Area Model 111
9.1 Model and Assumptions 111
9.2 Peak Performance Density 114
9.3 Granularity 117
9.4 Contexts 121
9.5 Composition 124
9.6 Summary 128
IV New Architectures 129
10 Dynamically Programmable Gate Arrays 130
10.1 DPGA Introduction 132
10.2 Related Architectures 137
10.3 Realm of Application 138
iv10.3.1 Limited Throughput Requirements 138
10.3.2 Latency Limited Designs 140
10.3.3 Temporally Varying or Data Dependent Functional Requirements 141
10.3.4 Multicontext versus Monolithic and Partial Reconﬁguration 141
10.4 A Prototype DPGA 145
10.4.1 Architecture 145
10.4.2 Implementation 150
10.4.3 Component Operation 157
10.4.4 Prototype Context Area Model 158
10.4.5 Prototype Conclusions 158
10.5 Circuit Evaluation 160
10.5.1 Levelization 160
10.5.2 Latency Limited Designs 160
10.5.3 Limited Task Throughput 167
10.6 Temporally Varying Logic – Finite State Machines 182
10.6.1 Example 182
10.6.2 Full Temporal Partitioning 184
10.6.3 Partial Temporal Partitioning 184
10.6.4 Comparison with Memory-based FSM Implementations 202
10.6.5 Areas for Improvement 204
10.6.6 General Technique 204
10.7 Additional Application Styles 205
10.7.1 Multifunction Components 205
10.7.2 Utility Functions 205
10.7.3 Temporally Systolic Computations 206
10.8 Control 208
10.8.1 Segregation 208
10.8.2 Distribution 209
10.8.3 Source 210
10.9 Conclusions 212
11 Dynamically Programmable Gate Arrays with Input Registers 213
11.1 Input Registers 213
11.2 iDPGA Model 215
11.3 Example 216
11.4 Circuit Benchmarks: Input Depth 218
11.4.1 Mapping 218
11.4.2 Detailed Example: alu2 220
11.4.3 Average Characteristics 224
11.4.4 Area for Improvement 230
11.5 Other Input Retiming Models 231
11.6 Summary 232
11.7 Review 233
v12 Time-Switched Field Programmable Gate Arrays 235
12.1 Time-Switched Input Registers 236
12.2 Switched Interconnect – Folding 237
12.2.1 Subarray Structure 237
12.2.2 Interconnect Folding 238
12.3 Architecture 241
12.4 Architecture Parameters 245
12.5 TSFPGA Implementation Estimates 247
12.5.1 Area 247
12.5.2 Timing 247
12.6 TSFPGA Fast Circuit Mapping 249
12.7 Circuit Mapping 251
12.8 Related Work 255
12.9 Conclusions 256
12.10Open Issues 256
13 MATRIX 258
13.1 MATRIX Concepts 260
13.2 MATRIX Architecture Overview 261
13.2.1 BFU 261
13.2.2 Network 263
13.2.3 Port Architecture 263
13.2.4 Port Contexts 264
13.2.5 Metaconﬁguration Conﬁguration 265
13.2.6 Time-Switching 266
13.2.7 Resource Deployment Granularity 266
13.2.8 Additional Information 267
13.3 Usage Example: Finite-Impulse Response Filter 268
13.4 Flexible Instruction Distribution 272
13.5 MATRIX Implementation 277
13.6 Building Block Efﬁciency 279
13.6.1 Memory 279
13.6.2 Datapath Elements 279
13.7 Image Processing Examples 281
13.7.1 VSR 281
13.7.2 RVF 284
13.7.3 BFIR 288
13.7.4 MFIR 291
13.7.5 Image Processing Summary 292
13.8 Summary 294
13.9 Area for Improvement 295
viV Review and Extrapolation 298
14 Reconﬁgurable Processing Architecture Review 299
15 Projections 305
15.1 Role of Memory in Computational Devices 305
15.1.1 Memory for Instructions 305
15.1.2 Memory for Retiming of Intermediate Data 307
15.1.3 Implications 308
15.2 Reconﬁguration: A Technique for the Computer Architect 309
15.3 Projecting General-Purpose Computing onto RP-space 311
15.3.1 General Hazards 311
15.3.2 Processors, FPGAs, and RP-space 312
15.3.3 General-Purpose Computing Space 314
15.4 Trends and Implications for Conventional Architectures 315
15.4.1 Microprocessors 315
15.4.2 Multiprocessors 315
16 Review of Major Concepts 317
viiList of Figures
1.1 First Order Size Comparison for Conﬁgurable Designs 7
1.2 LUT and Interconnect Primitives for Multicontext FPGA 9
1.3 TSFPGA Organization 9
1.4 MATRIX Basic Functional Unit 10
2.1 TemporalReuseofLimitedActiveSilicononGeneral-PurposeComputingDevices 14
2.2 High-Level FPGA Abstraction 15
2.3 FPGA Array 16
2.4 Canonical 4-LUT Processing Element 17
2.5 Parallel and 19
2.6 Serial and 19
4.1 Basic Organization for a Processor 27
4.2 Inner Loop of Processor Implementation for Windowed Average 32
4.3 Processor Implemention for Parity Computation 33
4.4 Gate Implementation of any Function Computed by 7-input Lookup Table 36
4.5 Windowed Average – Pipelined FPGA Implementation 43
4.6 32-bit Parity – 4-LUT Implementation 44
4.7 Abacus (SIMD) Implementation of Windowed Average 47
4.8 Windowed Average – MATRIX Implementation 52
4.9 32-bit Parity – MATRIX Implementation 52
5.1 Comparison of Programmable and Custom Multiply Functional Densities 57
7.1 Conventional FPGA Interconnect Topology 69
7.2 FPGA Interconnect Caricature 70
7.3 Logical Structure of Hierchical Interconnect 73
7.4 Switching node in 2-ary Hierarchical Interconnect 74
7.5 Switches per LUT – Equation versus Direct Calculation 77
7.6 Switches per LUT – Equation versus Direct Calculation 78
7.7 Overhead Growth versus for various 81
7.8 Overhead for versus 82
7.9 Continuous Overhead for versus 83
7.10 Continuous Efﬁciency for versus 84
7.11 Continuous Efﬁciency for versus (Log Scale) 85
viii7.12 Sample versus Overheads 87
7.13 E(overhead) versus for Uniform Distribution 88
7.14 Network Bits per LUT v/s Rent Exponent for 4096 (K=4) 92
7.15 Network Bits per LUT v/s Number of LUTs for 2 (K=4) 94
7.16 Single Context FPGA Area 95
7.17 Multicontext FPGA Area 95
9.1 Peak Computational Density Versus Contexts and Datapath Width 115
9.2 Compute and Instruction Densities Versus Contexts and Datapath Width 116
9.3 EfﬁciencyasaFunctionofArchitecturalandTaskGranularityforSingleContext
Architectures 118
9.4 Efﬁciency as a Function of Architectural and Task Granularity 119
9.5 Efﬁciency versus Task Data Width for a 1024-context, 32-bit Granularity Device 120
9.6 Efﬁciency as a Function of Task Path Length and Architectural Contexts 122
9.7 Efﬁciency versus TaskPath Length for a 16-context, Single-bit Granularity Device123
9.8 Efﬁciency versus Task Path Length for a 256-context, 128-bit Granularity Device 123
9.9 Efﬁciency for Conventional FPGA Design Point ( 1, 1) 125
9.10 Efﬁciency for Coarse-Grain, Deep Memory Design Point ( 64, 1024) 126
9.11 Efﬁciency for Fixed 8, 64 127
10.1 Efﬁciency for DPGA Design Point ( 1, 16) 131
10.2 LUT and Interconnect Primitives for Multicontext FPGA 132
10.3 ASCII Hex Binary Task Description 132
10.4 4-LUT Mapping of ASCII Hex Binary 133
10.5 ASCII Hex Binary Circuit Retimed for Full Pipelining 135
10.6 Typical Multicomponent System 139
10.7 Multifunction Component in System 139
10.8 Function Distribution in System 140
10.9 Architecture and Composition of DPGA 146
10.10 DRAM Memory Primitive 147
10.11 Array Element 148
10.12 Subarray Local Interconnect 148
10.13 Inter Subarray Interconnect 149
10.14 Annotated Die Photo of DPGA Prototype 151
10.15 Photo of DPGA Subarray and Crossbar Tile 152
10.16 Plot of Array Element with Conﬁguration Memory 153
10.17 Plot of Crossbar with Conﬁguration Memory 155
10.18 ASCII Hex Binary Subcircuit 161
10.19 Area Breakdown versus Number of Contexts for des Benchmark 166
10.20 Area Breakdown versus Number of Contexts for C880 Benchmark 170
10.21 Area Breakdown versus Number of Contexts for alu2 Benchmark 171
10.22 Area versus Throughput for Multicontext Implemenations of alu2 Benchmark 174
10.23 versus for Coarse-grain Interleaved Contexts 180
10.24 Simple FSM Example 183
ix10.25 Two Context Implementation of Simple FSM Example 183
10.26 Area and Delay versus Number of Contexts for cse FSM Benchmark (Area
Target) 187
10.27 Area and Delay versus Number of Contexts for cse FSM Benchmark (Delay
Target) 188
10.28 Memory-based Implementation for Simple FSM Example 202
10.29 Canonical Video Coding Pipeline 207
10.30 Temporally Systolic Video Coding Pipeline 207
10.31 Control Distribution on DPGA Prototype 208
10.32 Multiple Controllers – Hardwired Control 209
10.33 Multiple Controllers – Conﬁgurable Control 210
10.34 Array Self Control Example 211
11.1 FPGA Array Element 214
11.2 DPGA Array Element 214
11.3 DPGA Array Element with Input Registers 214
11.4 iDPGA Array Element 4, 3 216
11.5 ASCII Hex Binary Implementation versus Contexts and Input Register Depth 217
11.6 alu2 Implementation Area versus Throughput 222
11.7 alu2 Area Ratios versus Throughput 223
11.8 Average Area Ratios versus Throughput 225
11.9 Average Area Ratios versus Contexts and Throughput 226
12.1 4-LUT with Time-Switched Input Register 237
12.2 Output Folding 239
12.3 Input Folding 239
12.4 Input and Output Folding 240
12.5 Two-Context DPGA as Input and Output Fold 240
12.6 TSFPGA Subarray Composition 241
12.7 TSFPGA Array Element Composition 242
12.8 Sample Inter-Subarray Network Connections 244
12.9 Sample Delay Increases with Context Packing 254
13.1 MATRIX BFU 261
13.2 BFU Control Logic 262
13.3 MATRIX Network 263
13.4 BFU Port Architecture 264
13.5 Systolic Convolution Implementation 268
13.6 Microcoded Convolution Implementation 269
13.7 Custom VLIW Convolution Implementation 270
13.8 VLIW/MSIMD Convolution Implementation 271
13.9 Conﬁgurable Datapaths 273
13.10 Datapath Composition: MATRIX versus Conventional 8 Architecture 274
13.11 Conﬁgurable Instruction Streams 275
x13.12 Conﬁgurable Control Streams 276
13.13 MATRIX BFU Composition 277
13.14 MATRIX Implemenation of Full 8-TAP, 4096 shift, VSR 282
13.15 Processor Implementation of VSR 282
13.16 MATRIX RVF Array 285
13.17 RVF Dataslice and Logic for Cells Below th Postion 286
13.18 Control for MATRIX RVF for Cells Below th Postion 286
13.19 Processor Implementation of RVF 287
13.20 MATRIX BFIR Datapath 288
13.21 Processor Implementation of BFIR 289
13.22 Efﬁciency for MATRIX and Fixed 8-bit Architecture ( 0 70) 296
14.1 FPGA and DPGA efﬁciency in RP-space 303
15.1 Comparing efﬁciency of FPGA and Processor idealizations in RP-space 313
xiList of Tables
4.1 Basic ALU Operations and Capacities 28
4.2 Survey of Processor Capacity 29
4.3 Processor Capacity Summary 30
4.4 Average Gate Evaluations/Datapath Bit 31
4.5 Survey of VLIW Capacity 33
4.6 VLIW Capacity Summary 34
4.7 Survey of DSP Capacity 35
4.8 DSP Capacity Summary 35
4.9 Survey of Peak Memory Logic Capacity (SRAM) 38
4.10 Survey of Peak Memory Logic Capacity (DRAM) 39
4.11 Survey of Peak Memory Logic Capacity (Hybrid) 40
4.12 Survey of Processor On-Chip Memory Capacity 40
4.13 Survey of FPGA Capacity 41
4.14 FPGA Capacity Summary 42
4.15 Survey of SIMD Processor Capacity 45
4.16 SIMD Processor Capacity Summary 45
4.17 Example Vector Processor Capacity 46
4.18 Vector Processor Capacity Summary 46
4.19 Multimedia Processor Capacity 48
4.20 Summary of Multimedia Processor Capacity 48
4.21 Survey of Multi-Context FPGA Capacity 49
4.22 Multi-Context FPGA Capacity Summary 49
4.23 Survey of MIMD Processor Capacity 49
4.24 Survey of Reconﬁgurable ALU Capacity 50
4.25 Survey of Reconﬁgurable ALU Capacity 50
4.26 General-Purpose Computational Capacity Summary 53
5.1 Survey of Multiplier Capacity 55
5.2 Sample Semi-Custom Multiplier Capacity 55
5.3 Survey of Programmable Multiply Capacity 56
5.4 Multiply Using Standard ALU Operations 57
5.5 Yielded Multiply Capacity as a Function of Granularity 58
5.6 Survey of Specialized Programmable Multiply Capacity 58
xii6.1 Survey of FPGA-Implemented Processor Capacity 60
7.1 FPGA 4-LUT Size 64
7.2 Bits per 4-LUT 64
7.3 FPGA Delay Breakdown 65
7.4 Parameters for a Sampling of Contemporary Programmable Devices 91
7.5 Conﬁguration Bits – Requirement Upper Bound v/s Actual 91
7.6 4-LUT in 2-ary Hierarchical Interconnect with 2
3 93
8.1 Instruction Control Taxonomy 109
9.1 Summary of Area Model Parameters 112
9.2 for 0 5, 4, 2 113
9.3 Area for Instruction Control Sampling 113
10.1 DPGA Prototype Implementation Characteristics 150
10.2 Basic Component Sizes for Prototype 150
10.3 Array Core Area Breakdown by Programmable Function 152
10.4 DRAM Column Breakdown 154
10.5 Memory Area Breakdown 154
10.6 Estimated Timings 156
10.7 MCNC CircuitBenchmarks –Latency Limited – Two-ContextDPGA Impleme-
nation 162
10.8 MCNCCircuitBenchmarks–Latency Limited–Four-ContextDPGA Impleme-
nation 163
10.9 MCNC Circuit Benchmarks – Latency Limited – Context per Level DPGA
Implemenation 164
10.10 Multicontext Implementations of alu2 versus Throughput (LUTs) 169
10.11 Multicontext Implementations of alu2 versus Throughput (Area) 172
10.12 Multicontext Implementations of alu2 versus Throughput (Area Ratios) 173
10.13 Benchmark Set Area – Mapped Characteristics 175
10.14 Selected Area/Throughput Points for Benchmark Set (1 Clock/Result) 176
10.15 Selected Area/Throughput Points for Benchmark Set (10 Clock/Result) 177
10.16 Selected Area/Throughput Points for Benchmark Set (20 Clock/Result) 178
10.17 Full Partitioning of MCNC FSM Benchmarks (Area Target) 185
10.18 Full Partitioning of MCNC FSM Benchmarks (Delay Target) 186
10.19 Area and Delay versus Number of Contexts for cse FSM Benchmark (Area
Target) 189
10.20 Area and Delay versus Number of Contexts for cse FSM Benchmark (Delay
Target) 189
10.21 MCNC FSM Benchmarks LUTs v/s Number of Contexts (Area Target) 191
10.22 MCNC FSM Benchmarks Area v/s Number of Contexts (Area Target) 192
10.23 MCNC FSM Benchmarks Delay v/s Number of Contexts (Area Target) 193
10.24 MCNC FSM Benchmarks Area Ratio v/s Number of Contexts (Area Target) 194
10.25 MCNC FSM Benchmarks Delta Delay v/s Number of Contexts (Area Target) 195
xiii10.26 MCNC FSM Benchmarks Delay v/s Number of Contexts (Delay Target) 197
10.27 MCNC FSM Benchmarks LUTs v/s Number of Contexts (Delay Target) 198
10.28 MCNC FSM Benchmarks Area v/s Number of Contexts (Time Target) 199
10.29 MCNC FSM Benchmarks Delta Delay v/s Number of Contexts (Delay Target) 200
10.30 MCNC FSM Benchmarks Area Ratio v/s Number of Contexts (Delay Target) 201
10.31 Memory Implementations for MCNC FSM Benchmarks 203
11.1 Total Physial LUTs Required to Implement alu2 Benchmark 220
11.2 Total Area Required to Implement alu2 Benchmark 220
11.3 Area Ratios for alu2 Benchmark Implementation 221
11.4 Average Ratios for Benchmark Set 224
11.5 Average Ratios for Benchmark Set 226
11.6 Average Ratios for Benchmark Set 227
11.7 Average Ratios for Benchmark Set 227
11.8 Average Ratios for Benchmark Set 228
11.9 Average Ratios for Benchmark Set 228
11.10 Average Ratios for Benchmark Set 229
11.11 Average Ratios for Benchmark Set 229
12.1 TSFPGA Subarray Parameters 245
12.2 TSFPGA Mappings for MCNC Circuit Benchmarks 251
12.3 TSFPGA Mappings for MCNC Circuit Benchmarks (Ratios) 252
12.4 Modulo Context Sharing for MCNC Benchmarks 253
13.1 Area Breakdown for Prototype MATRIX BFU Implementation 277
13.2 MATRIX BFU Composition Estimate 278
13.3 VSR Implementation Comparison 283
13.4 RVF Implementation Comparison 284
13.5 BFIR Implementation Comparison 290
13.6 FIR Survey – 8 8 multiply, 24-bit Accumulate 291
13.7 FIR Survey – 8 8 multiply, 16-bit Accumulate 293
xivPart I
Introduction and Background
11. Overview and Synopsis
1.1 Evolution of General-Purpose Computing with VLSI Technology
General-purpose computers have served us well over the past couple of decades. Broad
applicability has led to wide spread use and volume commoditization. Flexibility allows a single
machine to perform a multitude of functions and be deployed into applications unconceived at
the time the device was designed or manufactured. The ﬂexibility inherent in general-purpose
machines was a key component of the computer revolution.
Todate, processorshavebeenthedrivingenginebehindgeneral-purposecomputing. Originally
dictated by the premium for active real estate, processors focus on the heavy reuse of a single or
small number of functionalunits. With Very Large Scale Integration (VLSI), we can now integrate
complete and powerful processors onto a single integrated circuit, and the technology continues to
provide a growing amount of real estate.
As enabling as processors have been, our appetite and need for computing power has grown
faster. Despite the fact that processor performance steadily increases, we often ﬁnd it necessary
to prop up these general-purpose devices with specialized processing assists, generally in the form
of specialized co-processors or ASICs. Consequently, today’s computers exhibit an increasing
disparitybetweenthegeneral-purposecoreanditsspecializedassistants. Highperformancesystems
are built from a plethora of specialized ASICs. Even today’s high-end workstations dedicate more
active silicon to specialized processing than to general-purpose compute. The general-purpose
processorwill be only a small part of tomorrow’s multi-mediaPC. As thistrend continues, theterm
“general-purpose computer” will become a misnomer for modern computer systems. Relatively
little of the computing power in tomorrow’s computers can be efﬁciently deployed to solve any
problem.
The problem is not with the notion of general-purposecomputing, but with the implementation
technique. For the past several years, industry and academia have focussed largely on the task
of building the highest performance processor, instead of trying to build the highest performance
general-purpose computing engine. When active area was extremely limited, this was a very
sensible approach. However, as silicon real estate continues to increase far beyond the space
requiredto implementacompetentprocessor, it is timeto re-evaluategeneral-purposearchitectures
in light of shifting resource availability and cost.
In particular, an interesting space has opened between the extremes of general-purpose pro-
cessors and specialized ASICs. That space is the domain of reconﬁgurable computing and offers
all the beneﬁts of general-purpose computing with greater performance density than traditional
processors. This space is most easily seen by looking at the binding time for device function.
ASICs bind function to active silicon at fabrication time making the silicon useful only for the
designated function. Processors bind functions to active silicon only for the duration of a single
cycle, a restrictive model which limits the amount the processor can accomplish in a single cycle
while requiring considerable on-chip resources to hold and distribute instructions. Reconﬁgurable
2devices allow functions to be bound at a range of intervals within the ﬁnal system depending on
the needs of the application. This ﬂexibility in binding time allows reconﬁgurable devices to make
better use of the limited device resources including instruction distribution.
Consequently, reconﬁgurable computing architectures offer:
More application-speciﬁc adaptation than conventional processors
Greater computational density than conventional processors
More and broader reuse of silicon than ASICs
Better opportunities to ride hardware and algorithmic technology curves than ASICs
Better match to current technology costs than ASICs or processors
1.2 This Thesis
This thesis characterizes a class of reconﬁgurable computing architectures and relates them
broadly to the more well understood conventional alternatives. Since technology costs dictate the
architectural tradeoffs involved, this characterization is performed in the context of MOS VLSI
implementations. The convergence of process technologies along with the large amount of silicon
real-estate available on a single die these days allows us to perform broad comparisons based
primarily on silicon area.
The thesis provides:
1. A high level characterization of a reconﬁgurable processing space which includes recon-
ﬁgurable architectures such as FPGAs. This characterization helps us understand the key
characteristics of reconﬁgurable devices, including when and what level of performance we
can extract from various architectural points.
2. Empirical relations on the key building blocks in CMOS VLSI taken from existing designs
in the literature and our own experimental designs, include:
sizes (e.g. How big is a 4-LUT?)
performance density (operations per unit space-time)
relative feature sizes (e.g. interconnect versus conﬁguration memory versus active com-
puting)
ﬁrst order modeling of key area factors
3. Architecture designs and implementations which explore new points in the identiﬁed design
space based on the empirical characterization.
architectures which exploit the identiﬁed cost structure to provide greater functional
density for reconﬁgurable devices
architectures which allow diversity/density tradeoffs based on application characteristics
4. Lessons and observations for future device architects and systems designers
The major contributions of this thesis include:
1. RP-space model for reconﬁgurable processing architectures– While many loose taxonomies
exist for general-purpose computing, none are as systematic as the one presented here.
3By focusing on the RP-space domain, this model provides size estimates and facilitates
pedagogical efﬁciency comparisons of architectures within the space.
2. DPGA – A novel bit-level architecture with multiple, on-chip instructions per compute
element, including the theory and concepts behind the architecture, an implementation, ex-
perimentalCADto supportit, andvalidationofefﬁciencyusingstandardcircuitbenchmarks.
For typical logic circuits and ﬁnite-state machines, the DPGA implementation is one-third
the size of the FPGA implementation.
3. TSFPGA – A model, possible implementation, and CAD for ﬁne-grained, time-switched
interconnect with demonstrated fast physical mapping capabilities. TSFPGA exploits the
observations that most of the area beneﬁts in DPGAs come from sharing the interconnect
and that most of the difﬁculty in mapping to traditional FPGAs is their limited interconnect.
By sharinginterconnect resourcesin time, TSFPGA extracts more interconnect functionality
from less active switching resources. For typical applications, quick mapping can be done
in seconds. The mapped design area is smaller than comparable FPGAs and slightly larger
than comparable DPGAs.
4. MATRIX–Theﬁrstarchitecturetoallowrun-timebindingofinstructionresources. Focusing
on a design pointusingan arrayof8-bit ALUandregister-ﬁlebuilding blocksinterconnected
via a byte-widenetwork, MATRIXyields 10-20 the computationaldensity of conventional
processors onsampleimageprocessingtasks. With today’s silicon, we canplace hundredsof
these 8-bit functional units on a large die operating at 100MHz, making it possible to deliver
over 10 Gop/s (8-bit ops) per component.
Theremainder ofthischapterprovidesasynopsisof thekeyresults andrelationshipsdeveloped
in the thesis. This introductory part of the thesis continues with Chapter 2 which deﬁnes the
terminology and metrics used throughout the thesis. Chapter 3 reviews and highlights the existing
evidence for the high performance potential of reconﬁgurable computing architectures.
Part II sets the stage by examining the computational capabilities of existing general-purpose
computing devices. This starts with a broad, empirical, review of general-purpose architectures in
Chapter 4. In Chapter 5, we compare hardwired and general-purpose multiplier implementations
as a case study bridging general-purpose and application-dedicated architectures. In Chapter 6,
we review processor architectures implemented on top of reconﬁgurable architectures to broaden
the picture and to see one way in which conventional reconﬁgurable architectures deal with high
operational diversity.
Part III takes a more compositional view of reconﬁgurable computing architectures. Chapter 7
looks at building blocks, sizes, and requirements for interconnect. Chapter 8 looks at resource
requirements for instruction distribution. Finally in Chapter 9, we bring the empirical data, in-
terconnect, and instruction characteristics together, providing a ﬁrst order model of RP-space, our
high-level model for reconﬁgurable processing architectures.
Part IV includes three new architectures: DPGA (Chapters 10 and 11), TSFPGA (Chapter 12),
and MATRIX (Chapter 13), which are highlighted below in Section 1.6. The ﬁnal chapters in
Part V, review the results and identify promising directions for the future.
41.3 Reconﬁgurable Device Characteristics
Broadlyconsidered,reconﬁgurabledevicesﬁlltheirsiliconareawithalargenumberofcomput-
ing primitives, interconnected via a conﬁgurable network. The operation of each primitive can be
programmedaswellas theinterconnectpattern. Computationaltasks can be implementedspatially
onthedevice withintermediatesﬂowingdirectlyfromtheproducingfunctionto thereceivingfunc-
tion. Since we can put thousands of reconﬁgurable units on a single die, signiﬁcant data ﬂow may
occur without crossing chip boundaries. To ﬁrst order, one can think about turning an entire task
into hardware dataﬂow and mapping it on the reconﬁgurable substrate. Reconﬁgurable comput-
ing generally provides spatially-oriented processingrather than the temporally-orientedprocessing
typical of programmable architectures such as microprocessors.
The key differences between reconﬁgurable machines and conventional processors are:
Instruction Distribution – Rather than broadcasting a new instruction to the functional
units on every cycle, instructions are locally conﬁgured, allowing the reconﬁgurable device
to compress instruction stream distribution and effectively deliver more instructions into
active silicon on each cycle.
Spatial routing of intermediates – As space permits, intermediate values are routed in
parallelfromproducingfunctiontoconsumingfunctionratherthanforcingallcommunication
to take place in time through a central resource bottleneck.
More, often ﬁner-grained, separately programmable building blocks – Reconﬁgurable
devices provide a large number of separately programmable building blocks allowing a
greater range of computations to occur per time step. This effect is largely enabled by the
compressed instruction distribution.
Distributed deployable resources, eliminating bottlenecks – Resources such as memory,
interconnect, and functional units are distributed and deployable based on need rather than
being centralized in large pools. Independent, local access allows reconﬁgurable designs
to take advantage of high, local, parallel on-chip bandwidth, rather than creating a central
resource bottleneck.
1.4 Conﬁgurable, Programmable, and Fixed-Function Devices
To establish an intuitive feel for the design point and role of conﬁgurable devices, we can
take a high-level look at conventional devices. Ignoring, for the moment multiplies, ﬂoating-point
operations, and table lookup computations, the modern processor has a peak performance on the
order of 256, 3-LUT gate-evaluations per clock cycle (e.g. two 64-bit ALUs). A modern FPGA
has a peak performance on the order of 2,048, 4-LUT gate-evaluations per clock cycle. The basic
clock cycle time is comparable giving the FPGA at least an order of magnitude larger raw capacity.
Note that both the processor ALUs and FPGA blocks are typically built with additional gates
which serve to lower the latency of word operations without increasing the raw throughput (e.g.
fastcarrychainswhichallowafull64-bitwideaddtocompletewithinonecycletime). Thislatency
5reduction may be important to reducing the serial path length in tasks with limited parallelism, but
is not reﬂected in this raw capacity comparison.
The FPGA can sustain its peak performance level as long as the same 2K gate-evaluation
functionality is desired from cycle to cycle. Wiring and pipelining limitations are the primary
reason the FPGA would achieve lower than peak performance, and this is likely to account for,
at most, a 20-50% reduction from peak performance. If more diverse functionality is desired
from a single FPGA than the 1-2K gate-evaluations provided by the FPGA, performance drops
considerably due to function reload time.
The processor is likely to provide a much lower peak performance and the effect is much
more application speciﬁc. Due to the bitwise-SIMD nature of traditional ALUs, work per cycle
can be as low as a couple of gate-evaluations on compute operations. Since processors perform
all “interconnect” using shifts, moves, loads, and stores, many cycles yield no gate-evaluations,
only movement of data. The lower peak performance of processors comes from the fact that the
processor ALU occupies only a small fraction of the die, with substantial area going to instruction
ﬂowcontrolandon-chipmemorytosupportlargesequencesofdiverseoperationswithoutrequiring
off-chip instruction or data access.
A comparablysized, dedicated pieceof hardwiredfunctionality,withnomemory couldprovide
a capacity of 200,000-300,000 4-LUT gate-evaluations per clock cycle, at potentially higher clock
rates. While the raw gate delay on the hardwired logic can be 10 smaller than on the FPGA,
reasonable cycle times in equivalent logic processes are closer to 2 since it makes sense to
pipelined the FPGA design at a more shallow logic depth than the custom logic. Returning to the
multiplier, for example, such a chip might provide 64K multiply bit operations per cycle (e.g. a
256 256 multiply pipelined at the byte level). The dedicated hardware provides 100-300 times
the capacity of the FPGA on the one task it was designed to solve. To ﬁrst order, the dedicated
hardwarecandeliververy littlecapacityto signiﬁcantlydifferentapplications. It isalsoworthwhile
to note that the ﬁxed granularity of hardwired devices often causes them to sacriﬁce much of their
capacity advantage when used on small data items. For instance, performing an 8 8 multiply on a
64 64 hardwired multiplier makes use of only 1
64’th of the multiplier’s capacity, removing much
of its 300 capacity advantage.
Combining these observations, we can categorize the circumstances under which the various
structures are prefered.
Fixed Function,LimitedOperationDiversity, HighThroughput –When thefunctionand
data granularity to be computed are well-understood and ﬁxed, and when the function can
be economicallyimplementedin space,dedicated hardwareprovidesthemost computational
capacity per unit area to the application.
Variable Function, Low Diversity – If the function required is unknown or varying, but
the instruction or data diversity is low, the task can be mapped directly to a reconﬁgurable
computing device and efﬁciently extract high computational density.
Space Limited, High Entropy – If we are limited spatially and the function to be computed
has a high operation and data diversity, we are forced to reuse limited active space heavily
and accept limited instruction and data bandwidth. In this regime, conventional processor
organization are most effective since they dedicate considerable space to on-chip instruction
6Interconnect
Configuration
     Memory
Active
Logic
Figure 1.1: First Order Size Comparison for Conﬁgurable Designs
storageinordertominimizeoff-chipinstructiontrafﬁcwhileexecutingdescriptivelycomplex
tasks.
Reconﬁgurable devices have become increasingly interesting as aggregate IC capacity has grown
largeenoughtoadequatelyholdthecomputationaldiversityofmanycomputingtasksor, atleast,the
key kernels of these tasks. As the area available for general-purpose computing devices increases,
more tasks will ﬁt conveniently on reconﬁgurable devices, increasing the range of applications
where the reconﬁgurable solution yields higher performance per unit area.
In Chapter 2 we deﬁne our evaluation and comparison metrics more carefully. Chapters 4
and 5 provide an empirical review of conventional general-purpose and specialized architectures,
focusing on their performance density.
1.5 Key Relations
While reconﬁgurable devices have, potentially, 100 less performance per unit area than hard-
wired circuitry, they provide 10-100 the performance density of processors. As noted above,
FPGAs offer a potential 10 advantage in raw, peak, general-purpose functional density over
processors. This density advantage comes largely from dedicating signiﬁcantly less instruction
memory and distribution resources per active computing element. At the same time this lower
memory ratio allows reconﬁgurable devices to deploy active capacity at a ﬁner grained level,
allowing them to realize a higher yield of their raw capacity, sometimes as much as 10 , than con-
ventional processors. It is these two effects taken together which give reconﬁgurable architectures
their 10-100 performance density advantage over conventional processor architectures in many
situations.
From an empirical review of conventional, reconﬁgurable devices, we see that 80-90% of the
area is dedicated to the switches and wires making up the reconﬁgurable interconnect. Most of
the remaining area goes into conﬁguration memory for the network. The actually logic function
only accounts for a few percent of the area in a reconﬁgurable device. This interconnect and
conﬁguration overhead is responsible for the 100 density disadvantage which reconﬁgurable
devices suffer relative to hardwired logic.
To a ﬁrst order approximation, this gives us:
10 100 1 1
7It is this basic relationship (Shown diagrammatically in Figure 1.1) which characterizes the RP
design space.
Since , devices with a single on-chip conﬁguration, such as most
reconﬁgurable devices, can afford to exert ﬁne-grained control over their operations – any
savings associated with sharing conﬁguration bits would be small compared to the network
area.
Since , to pack the most functional diversity into a part, one can
allocate multiple conﬁgurations on chip. With the order-of-magnitude relative sizes given in
Relation 1.1, up to a 10 increase in the functionaldiversity per unit area is attainablein this
manner.
However, since is only10 , ifthenumberofconﬁgurationsis large, say
100ormore,theconﬁgurationmemoryareawillbecomethedominantsizefactor. Processors
are essentially optimized into this regime and that partially accounts for their 10 lower raw
performance density compared to reconﬁgurable devices.
Once we go to a large number of contexts, such that the total conﬁguration memory space
begins to dominate interconnect area, ﬁne-granularity becomes costly. In this regime wide-
wordoperationsallowustoamortizeinstructionareaacrossmultiplebitprocessingelements.
This simultaneously allows machines with wide (e.g. 32 bit) datapaths to hold 1000’s of
conﬁgurations on chip while making them only 10 less computationally dense than ﬁne-
grained, single context devices.
After reviewing implementations in Chapter 4, Chapters 7 and 8 examine interconnect and
instruction delivery issues in depth. Chapter 9 brings these together, yielding a slightly more so-
phisticatedmodelthantheoneabovetoexplaintheprimarytradeoffsinthedesignofreconﬁgurable
computing architectures.
1.6 New General-Purpose Architectures
Fromthegeneralrelationshipsabove,weseethatconventionalconventionalFieldProgrammable
Gate Arrays (FPGAs) representone extreme in our RP-space. The space is large, leaving consider-
able space for interesting architectures in middle. Exploiting the relative area properties identiﬁed
above and common device usage scenarios, we have developed three new general-purpose com-
puting architectures. By judicious allocation of device resources, these architectures offer higher
yielded capacity over a wide range of applications.
DPGA The Dynamically Programmable Gate Array (DPGA) is a multicontext FPGA, formed
by associating memory for several conﬁgurations with each active LUT and interconnect switch
(See Figure 1.2). From Relation 1.1, we see that the area associated with context memory is small
and can be replicated several times without substantially impacting active device capacity. The
multicontextdesign allowsthedevice to reuseits activecapacity to provideadditionalfunctionality
rather than additional throughput. For the operations required by an application which are not
8Memory
Context ID
Decode
Context ID
Decode
Interconnect
Figure 1.2: LUT and Interconnect Primitives for Multicontext FPGA
Crossbar
xout0
xout1
xout2
xout3
yout0
yout1
y
o
u
t
3
yout2
pipeline
registers
Timestep Context
Interconnect Memory
xin0
xin1
xin2
xin3
yin0
yin1
yin2
yin3
AE
AE
AE
AE
2 2
output
select
load 
input
select
load 
input
select
    TSFPGA
Array Element
Figure 1.3: TSFPGA Organization
the throughput bottleneck, the multicontext device yields higher device capacity than conventional
FPGA architectures. Chapter 10 describes our 4-context DPGA design and implementation, iden-
tiﬁes several common usage scenarios, and details experimental mapping techniques for circuits
and ﬁnite-state machines. Chapter 11 extends the basic DPGA model and circuit mapping tools to
includeinputretimingregisters. The resultingarchitectureachieves 3 thedensity ofconventional
FPGAs without sacriﬁcing performance on typical applications.
TSFPGA A careful review of the DPGA implementation and Relation 1.1, reminds us that the
activelogic portionofareconﬁgurabledesigncomprisesonlyasmallfractionofthespacewhilethe
programmable network is the key area consumer. The Time-Switched FPGA (TSFPGA) focuses
on reuse of the critical switch and wire resources (See Figure 1.3). By pipelining the switching
9A_ADR B_ADR
A PORT B PORT
MODE
DATA
WE
ALU
Function
(Fa)
Memory
Function
(Fm)
Address/
Data A
Address/
Data B
BFU
Core
A B
Fa Fm
Out
Floating
Port 1 (FP1)
L3 Control
Lines
Incoming
Network Lines
(L1, L2, L3)
Incoming
Network Lines
(L1, L2, L3)
Switch 1 (N1)
Network Network
Switch 2 (N2)
Level 2, 3
Network Drivers
Network
Level 1
Network Drivers
Network Port A
Network Port B
Control
Logic
Carry In Carry Out
ALU Function Port
Control
Logic
A_in B_in
C_in C_out
F_sel ALU
Out
Memory Function Port
Memory
Block
Floating
Port 2 (FP2)
Figure 1.4: MATRIX Basic Functional Unit
operations TSFPGA allows us to extract higher capacity from the available switches and wires.
At the same time, the switched interconnect allows each individual switching element to play a
number of different roles. Consequently, TSFPGA compresses switching requirements, providing
more effective switching capacity with less physical interconnect. The greater yielded switching
capacity allows physical design mapping to occur rapidly – in seconds rather than the hours typical
of conventional FPGA architectures. Chapter 12 details the TSFPGA design, implementation, and
experimental mapping software.
MATRIX All prior general-purpose computing architectures, including processors, FPGAs, and
the two previous architectures make a rigid distinction between instruction and control resources
which manage computation and the computing resources which perform computations for an
application. Consequently, one must make a fabrication time decision about the device’s control
structure and the deploymentof resources for control. We see in Chapters8 and 9 that this decision
has a large impact on the distribution of dedicated instruction resources in the design and the range
of applications where the device is efﬁciently employed. MATRIX is a novel, coarse-grained,
computing architecture which uses a multilevel conﬁguration scheme to defer this binding to the
application(SeeFigure 1.4). Ourfocusimplementationusesan8-bit primitivedatapathelementfor
the basic functional unit. Rather than separate the resourcesfor instruction storage and distribution
from the resources for data storage and computation and ﬁx them at fabrication time, the MATRIX
architecture uniﬁes these resources. Once uniﬁed, traditional instruction and control resources are
decomposedalongwithcomputingresourcesandcanbedeployedinanapplication-speciﬁcmanner.
Chip capacity can be deployed to support active computation or to control reuse of computational
resources depending on the needs of the application and the available hardware resources. As a
10result,MATRIXcanbeefﬁcientlyemployedacrossabroaderrangeofcomputationalcharacteristics
than conventional architectures. Chapter 13 introduces the MATRIX architecture and shows how
it obtains these unique characteristics.
1.7 Prognosis for the Future
Ultimately,reconﬁgurationisatechniqueforcompressingtheresourcesdedicatedtoinstruction
stream distribution while maintaining a general-purpose architecture. As such, it is an important
architecturaltoolforextractingthehighestperformancefromoursiliconreal estate. Characteristics
of an application which change slowly or do not change can be conﬁgured rather than broadcast.
The savings in instruction control resources result in higher logic capacity per unit area.
With CMOS VLSI we have reached to the point where we are no longer so limited by the
aggregate capacity of a single IC die that the device must be optimized exclusively to maximize
the number of distinct instructions resident on a chip. Beyond this point spatial implementation of
all or portions of general-purpose computations is both feasible and beneﬁcial. From this point on
we will see:
1. More applications and kernels ﬁt for spatial implementations on reconﬁgurable substrates
2. Reconﬁgurable techniques ﬁnd their way into general-purpose and ﬂexible computing de-
vices, changing the way we design even “nominally” conventional architectures
Reconﬁgurable architectures and techniques should be added to the modern computer architect’s
repertoire of design techniques, alongside more venerable ones such as microprogramming, trans-
lation, and caching.
The thesis closes in Part V by reviewing the key lessons from reconﬁgurable designs and their
implications for future general-purpose architectures.
112. Basics and Terminology
In this chapter we introduce much of the terminology used throughout the document. We start
with a high-level review of general-purpose computing. We deﬁne the distinction between pro-
grammable and conﬁgurable devices and the various components of conﬁgurable devices. Much
of the discussion will take Field-Programmable Gate Arrays as a basis, so we introduce an initial,
conceptual FPGA model. Finally, we deﬁne metrics for capacity, density, and diversity which
will be used when characterizing the various architectures reviewed. A glossary summarizing
terminology follows Chapter 16.
2.1 General-Purpose Computing
General-purposecomputing devicesare speciﬁcally intendedfor those cases where, econom-
ically,we cannotor neednot dedicate sufﬁcientspatial resourcestosupport anentire computational
task or where we do not know enough about the required task or tasks prior to fabrication to
hardwire the functionality. The key ideas behind general-purpose processing are:
1. Defer binding of functionality until device is employed – i.e. after fabrication
2. Exploit temporal reuse of limited functional capacity
Delayed binding and temporal reuse work closely together and occur at many scales to provide the
characteristics we now expect from general-purpose computing devices.
We are quite accustomed to exploiting these properties so that unique hardware is not required
for every task or application. This basic theme recurs at many different levels in our conventional
systems:
MarketLevel–Ratherthandedicatingamachinedesigntoasingleapplicationorapplication
family, the design effort may be utilized for many different applications.
System Level – Rather than dedicating an expensive machine to a single application, the
machine may perform different applications at different times by running different sets of
instructions.
Application Level – Rather than spending precious real estate to build a separate computa-
tionalunitforeachdifferentfunctionrequired,centralresourcesmaybeemployedtoperform
these functions in sequence with an additional input, an instruction, telling it how to behave
at each point in time.
Algorithm Level – Rather than ﬁxing the algorithms which an application uses, an existing
general-purpose machine can be reprogrammed with new techniques and algorithms as they
are developed.
12User Level – Rather than ﬁxing the function of the machine at the supplier, the instruction
stream speciﬁesthe function,allowing theenduser to usethe machineasbest suitshis needs.
Machines may be used for functions which the original designers did not conceive. Further,
machine behavior may be upgraded in the ﬁeld without incurring any hardware or hardware
handling costs.
Inthepast,processorswerevirtuallytheonlydeviceswhichhadthesecharacteristicsandserved
as general-purpose building blocks. Today, many devices, including reconﬁgurable components,
also exhibit the key properties and beneﬁts associated with general-purpose computing. These
devices are economically interesting for all of the above reasons.
2.2 General-Purpose Computing Issues
There are two key features associated with general-purpose computers which distinguish them
from their specialized counterparts. The way these aspects are handled plays a large role in
distinguishing various general-purpose computing architectures.
2.2.1 Interconnect
In general-purpose machines, the datapaths between functional units cannot be hardwired.
Differenttasks will require differentpatternsofinterconnect between thefunctionalunits. Within a
task individual routines and operations may require different interconnectivity of functional units.
General-purposemachinesmustprovidetheabilitytodirectdataﬂowbetweenunits. Intheextreme
of a single functional unit, memory locations are used to perform this routing function. As more
functional units operate together on a task, spatial switching is required to move data among
functional units and memory. The ﬂexibility and granularity of this interconnect is one of the big
factors determining yielded capacity on a given application.
2.2.2 Instructions
Since general-purpose devices must provide different operations over time, either within a
computational task or between computational tasks, they require additional inputs, instructions,
whichtell thesilicon how to behave at any point in time. Each general-purposeprocessingelement
needs one instruction to tell it what operation to perform and where to ﬁnd its inputs. As we will
see, the handling of this additional input is one of the key distinguishing features between different
kinds of general-purpose computing structures. When the functional diversity is large and the
requiredtaskthroughputis low,it is notefﬁcient to builduptheentireapplicationdataﬂowspatially
in the device. Rather, we can realize applications, or collections of applications, by sharing and
reusing limited hardware resources in time (See Figure 2.1) and only replicating the less expensive
memory for instruction and intermediate data storage.
2.3 Programmables and Conﬁgurables
The distinction between programmable devices and conﬁgurable devices is mostly artiﬁcial
– particularly since we show in Part III that these architectures can be viewed in one uniﬁed
13Memory/Switching
Operation
In general, we cannot embedthe entire dataﬂowfor a computational task (top)in hardware.
Consequently,we mustreusethelimitedactivesilicon resourcesavailablein time(bottom),
using additional control inputs, instructions, to tell the active silicon how to behave at each
point in time to realize the desired computational task.
Figure 2.1: Temporal Reuse of Limited Active Silicon on General-Purpose Computing Devices
design space. Nonetheless, it is useful to distinguish the extremes due to their widely varying
characteristics.
Programmable – we will use the term “programmable”to refer to architectures which heavily
and rapidly reuse a single piece of active circuitry for many different functions. The canonical
example of a programmable device is a processor which may perform a different instruction on its
ALU on every cycle. All processors, be they microcoded, SIMD, Vector, or VLIW are included in
this category.
Conﬁgurable–weusetheterm“conﬁgurable”torefertoarchitectureswheretheactivecircuitry
can performany of a numberofdifferentoperations, but the functioncannot bechanged from cycle
to cycle. FPGAsareour canonicalexample ofa conﬁgurabledevice. Once theinstructionhas been
“conﬁgured” into the device, it is not changed during an operational epoch.
One of the key distinction, then, is the balance between a piece of (1) active logic and its
associated interconnect, and (2) the local memory to conﬁgure the operation of the logic and
interconnect. We deﬁne one conﬁguration context as the collection of bits which describe the
behavior of a general-purpose machine on one operation cycle. One programming stream for
a conventional FPGA containing instructions for every array element along with interconnect
composesa“conﬁgurationcontext.” Onemightalsothinkofeachofthefollowingasaconﬁguration
14Processing Units
Input/Output
Interconnect
Figure 2.2: High-Level FPGA Abstraction
context.
one instruction for scalar processor
one VLIW word composed of instructions for all the parallel functional units
one line of horizontal microcode
2.4 FPGA Introduction
A Field-Programmable Gate Array (FPGA) is a collection of conﬁgurable processing units
embeddedin aconﬁgurableinterconnectionnetwork. Inthe contextof general-purposecomputing,
we concern ourselves here primarily with devices which can be reprogrammed. Figure 2.2 shows
the basic model.
From the high-level view shown in Figure 2.2, the FPGA looks much like a network of
processors. A conventional FPGA, however, differs from a conventional multiprocessor in several
ways:
Granularity – Conventional FPGAs have single bit processing elements, each of which is
controlled independently.
Instruction Control – Conventional FPGAs are conﬁgurable with a single instruction res-
ident per processing element. Changing instructions is slow compared to the rate at which
the processing element can operate on data.
Static Interconnect – With conventional FPGAs, interconnect is purely static, connecting
sources and sinks by locking down a path through the switching network.
15Context
Memory
Configurable
Interconnect
Processing
   Element
Figure 2.3: FPGA Array
Processing elements are generally organized in an array on the IC die with less than complete
interconnect (See Figure 2.3). Since full connectivity would grow as 2 , FPGAs employ
more restricted connection schemes to limit the resources required for interconnect which are,
nonetheless, the dominant area component in conventional devices. When processing elements
are homogeneous, as is typically the case, device placement can be used to improve interconnect
locality and accommodate the limited interconnect. The interconnect is typically arranged in a
hierarchical mesh.
The processingelements, themselves,are simplefunctions combininga smallnumber ofinputs
to producea singleoutput. The mostgeneralsuchfunctionbeinga Look-UpTable(LUT).We have
already noticed that the active logic function typically makes up only a small portion of the area.
Further, it turns out that most of the conﬁguration memory goes into describing the programmable
interconnect. Consequently,forprocessingelementswithasmallnumberofinputs,thecostofusing
afulllook-uptablefortheprogrammablelogicfunction,versussomemorerestrictedprogrammable
element, is small.
We will use 4-input lookup tables (4-LUTs), as the canonical FPGA processing element for
the purpose of discussion and comparisons. To ﬁrst order, reconﬁgurable FPGAs from Xilinx
[CDF 86, Xil94a], Altera [Alt95], and AT&T [ATT95] use 4-LUTs as their basic, constituent
162 2
LUT Memory
Optional
LUT Mux
Figure 2.4: Canonical 4-LUT Processing Element
processing elements. Research at the University of Toronto [RFLC90] indicates that four input
LUTs yield themost area efﬁcient designs across a collection circuit benchmarks. An optional ﬂip-
ﬂop is generally associated with each 4-LUT. Figure 2.4 shows the canonical, 4-LUT processing
element.
Throughout the thesis we make comparisons between processors and FPGAs. At times it is
convenient to equate small LUTs (2 to 4-LUTs) and ALU bits. It is therefore, worthwhile to note
that a 2-LUT can perform any logical operation including those provided by typical ALUs (e.g.
AND, OR, XOR, INVERT). A 3-LUT can act as a half-adder. A pair of 3-LUTs can serve as an adder
or subtracter bit-slice with one bit providing the carry out the other the data bit output. Together
these cover all the standard arithmetic and logic operations in a typical ALU, such that one or two
3-LUTs, with appropriate interconnect, can subsume any single ALU bit function.
2.5 Regular and Irregular Computing Tasks
Computing tasks can be classiﬁed informally by their regularity. Regular tasks perform the
same sequence of operations repeatedly. Regular tasks have few data dependent conditionals such
that all data processed requires essentially the same sequence of operations with highly predictable
ﬂow of execution. Nested loops with static bounds and minimal conditionals are the typical
example of regular computational tasks. Irregular computing tasks, in contrast, perform highly
data dependent operations. The operations required vary widely according to the data processed,
and the ﬂow of operation control is difﬁcult to predict.
2.6 Metrics: Density, Diversity, and Capacity
Thegoalofgeneral-purposecomputingdevicesistoprovideacommon,computationalsubstrate
which can be used to perform any particular task. In this section, we look at characterizing the
amount of computation provided by a given general-purpose structure.
To perform a particular computational task, we must extract a certain amount of computational
work from our computing device. If we were simply comparing tasks in terms of a ﬁxed processor
instruction set, we might measure this computational work in terms of instruction evaluations. If
we were comparing tasks to be implemented in a gate-array we might compare the number of
17gates required to implement the task and the number of time units required to complete it. Here,
we want to consider the portion of the computational work done for the task on each cycle of
execution of diverse, general-purpose computing devices of a certain size. The ultimate goal is to
roughly quantify the computational capacity per unit area provided to various application types by
the computationalorganizations under consideration. That is, we are trying to answer the question:
“What is the general-purpose computing capacity provided by this computing structure.”
We have two immediate problems answering this questions:
1. How do we characterize general-purpose computing tasks?
2. How do we characterize capacity?
The ﬁrst question is difﬁcult since it places little bounds on the properties of the computational
tasks. We can, however, talk about the performance of computing structures in terms of some
general properties which various important subclasses of general-purpose computing tasks exhibit.
We thus end up addressing more focussed questions, but ones which give us insight into the
properties and conditions under which various computational structures are favorable.
We will address the second question – that of characterizing device capacity – using a “gate-
evaluation” metric. That is, we consider the minimum size circuit which would be required to
implementataskandcountthenumberofgateswhichmustbeevaluatedtorealizethecomputational
task. We assume the collection of gates available are all -input logic gates, and use 4. This
models our 4-LUT as discussed in Section 2.4, as well as more traditional gate-array logic. The
results would not change characteristically using a different, ﬁnite value as long as 2.
2.6.1 Functional Density
Functional capacity is a space-time metric which tells us how much computational work a
structurecandoperunittime. Correspondingly,our“gate-evaluation”metricis aunitofspace-time
capacity. That is, we can get 4 “gate-evaluations” in one “gate-delay” out of 4 parallel and gates
(Figure 2.5) or in 4 “gate-delays” out of a single and gate (Figure 2.6).
That is, if a device can provide capacity gate evaluations per second, optimally, to the
application, a task requiring gate evaluations can be completed in time:
2 1
In practice, limitations on the way the device’s capacity may be employed will often cause the
task to take longer and the result in a lower yielded capacity. If the task takes time , to
perform , then the yielded capacity is:
2 2
The capacity which a particular structure can provide generally increases with area. To un-
derstand the relative beneﬁts of each computing structure or architecture independent of the area
in a particular implementation, we can look at the capacity provided per unit area. We normalize
area in units of , half the minimum feature size, to make the results independent of the particular
18clk
a<0>
b<0>
a<1>
b<1>
a<2>
b<2>
a<3>
b<3>
c<0>
c<1>
c<2>
c<3>
One Gate Delay
Figure 2.5: Parallel and
clk
a<3>
b<3>
c<3>
b<2>
a<2>
c<2>
clk
a<1>
b<1>
c<1>
clk
a<0>
b<0>
c<0>
clk
Gate Delay 1 Gate Delay 2
Gate Delay 3 Gate Delay 4
Figure 2.6: Serial and
19process feature size. Our metric for functional density, then, is capacity per unit area and is
measured in terms of gate-evaluations per unit space-time in units of gate-evaluations/ 2s. The
generalexpressionfor computingfunctionaldensitygivenanoperationalcycle time fora unit
of silicon of size evaluating gate evaluations per cycle is:
2 3
This capacity deﬁnition is very logic centric, not directly accounting for interconnect capacity.
We are treating interconnect as a generality overhead which shows up in the additional area
associated with each compute element in order to provide general-purpose functionality. This is
somewhat unsatisfying since interconnect capacity plays a big role in deﬁning how effectively one
uses logic capacity. Unfortunately, interconnect capacity is not as cleanly quantiﬁable as logic
capacity, so we make this sacriﬁce to allow easier quantiﬁcation.
As noted, the focus here is on functional density. This density metric tells us the relative
merits of dedicating a portion of our silicon budget to a particular computing architecture. The
density focus makes the implicit assumption that capacity can be composed and scaled to provide
the aggregate computational throughput required for a task. To the extent this is true, architectures
with the highest capacity density that can actually handle a problem should require the least size
and cost. Whether or not a given architecture or system can actually be composed and scaled to
deliver some aggregate capacity requirement is also an interesting issue, but one which is not the
focus here.
Remember also that resource capacity is primarily interesting when the resource is limited. We
look at computational capacity to the extent we are compute limited. When we are i/o limited,
performance may be much more controlled by i/o bandwidth and buffer space which is used to
reduce the need for i/o.
2.6.2 Functional Diversity – Instruction Density
Functional density alone only tells us how much raw throughput we get. It says nothing about
how many different functions can be performed and on what time scale. For this reason, it is
also interesting to look at the functional diversity or instruction density. Here we use functional
diversity to indicate how many distinct function descriptions are resident per unit of area. This
tells us how many different operations can be performed within part or all of an IC without going
outside of the region or chip for additional instructions. We thus deﬁne functional diversity as:
2 4
Toﬁrst order, we countinstructions as native processor instructionsor processing element conﬁgu-
rations assuming a nominally 4-LUT equivalent logic block for the logic portion of the processing
element.
2.6.3 Data Density
Space must also be allocated to hold data for the computation. This area also competes with
logic, interconnect, and instructions for silicon area. Thus, we will also look at data density when
20examining the capacity of an architecture. If we put of data for an application into a space,
, we get a data density:
2 5
213. Reconﬁgurable Computing Background
This chapter brieﬂy reviews reconﬁgurable computing including:
Modern successes
Intellectual lineage
Technology trendswhich determine the circumstances when reconﬁgurable architecturesare
viable and advantageous
3.1 Successes of Reconﬁgurable Computing
FPGAs ﬁrst became available in the middle of the 1980’s (e.g. [CDF 86]). In the late 80’s and
early 90’s we began to see reconﬁgurable computingengines enabled by these new devices. In this
section we highlight the early reconﬁgurable computing “successes.”
3.1.1 Programmable Active Memories
DEC PRL’s Programmable Active Memory (PAM) was one of the earliest platforms for recon-
ﬁgurable computing. PAM is an array of Xilinx 3K components connected to a host workstation
[BRV89]. The Perle-1board contained 23 XC3090’s – roughly 15,000 4-LUTs. Using this compo-
nent as an accelerator, DEC PRL was able to speedup many application by an order of magnitude
and,in somecases,provideperformancein excessofconventionalsupercomputersorcustomVLSI
implementations. Highlights from [BRV92]:
Large number multiply 16 faster than Cray-II
600kbit/s, 512-bit RSA decoding – fastest implementation in existence at time of development
– 10 best software implementation on DEC Alpha
String matching within a factor of two of custom implementation requiring 28 VLSI ICs
Convolution and 3-D geometry at 200-300 MIPs
Laplace equation at 25 GIPs
DCT at 15 GIPs
The total silicon in the Perle-1 board was comparable to the total silicon in the host workstation –
but the combination ran these applications and others 10 faster than the workstation alone. The
differencebeingthat almostall of thesilicon on thePerle-1 board was general-purposeand capable
of being deployed to the problem at hand.
3.1.2 Splash
SRC’s Splash is a systolic array composed of 32 Xilinx XC3090’s, 20K 4-LUTs. On DNA
sequence matching Splash achieved over 300 the performance of a Cray-II or over 200 the
performance of a 16K-processor CM-2 [GHK 91].
223.1.3 PRISM
Brown’s PRISM architecture coupled a single Xilinx XC3090, 640 4-LUTs, with a Motorola
68010 node processor. The coupled FPGA could compute ﬁne-grained, bitwise functions (e.g.
Hamming distance, bit reversal, ECC, logic evaluations, ﬁnd ﬁrst one), 20 faster than the 68010
host microprocessor [AS93].
3.1.4 Logic Emulation
Perhapsthemostcommerciallysigniﬁcantapplicationof“reconﬁgurablelogic”todatehasbeen
inthebusinessoflogicemulation. OneoftheearliestFPGA-basedlogicemulatorswastheRealizer
[VBB93]whichwasaprecursortoQuickturnSystem’sEnterpriseEmulationSystem. TheRealizer,
with 42 XC3090’s ( 27K 4-LUTs) and 160 XC2018’s serving exclusively for interconnect, was
able to emulate 10K gate designs at a rate of several million clock cycles per second.
3.2 Lineage
While reconﬁgurable architectures have only recently begun to show signiﬁcant application
viability, the basic ideas have been around almost as long as the idea of programmable general-
purpose computing.
John von Neumann, who is generally credited with developing our conventional model for
serial, programmable computing, also envisioned spatial computing automata – a grid of simple,
cellular, building blocks which could be conﬁgured to perform computational tasks [vN66].
As computing implementation technology improved from vacuum tubes to diodes and tran-
sistors to integrated circuits, research continued into cellular computation. In [Min67] Minnick
reviewed the state of the art in microcellular computational arrays, suggesting a role for “pro-
grammable arrays.” Minnick’s own cutpoint cellular array in 1964 housed 48 cells less powerful
than a 2-LUT in a 6 8 cellular array with only right and down nearest neighbor connections in the
space of a suitcase. In 1971, Minnick reported a programmable cellular array which used ﬂip-ﬂops
to hold the conﬁguration context which customized the array [Min71].
JumpandFitschedetailtheworkingsofaprogrammablecellulararray[JF72]withoutdescribing
a possible technology realization.
Schaffner developedoneoftheearliest“general-purpose,”“programmablehardware”machines
in 1969 [Sch71, Sch78]. Shaffner’s machine used ALU’s with reconﬁgurable interconnect for his
reconﬁgurable building blocks, including the facilities to swap in “hardware” pages. The machine
was employed primarily for real-time signal processing for radar and weather.
Theearlyeightiessawconsiderableinterestinsystoliccomputingarchitectures[Kun82]. While
much of the research was concerned with deriving hardwired, application-speciﬁc arrays, this
research also spawned the development of programmable systolic components (e.g. [FKM83]
[HS84]). These components were some of the ﬁrst “reconﬁgurable computing” devices built
in VLSI. Owing to the application focus and the silicon real estate available at the time, the
programmablesystolicbuildingblocksweremorecoarse-grainedthanthecellulararraysorFPGAS,
placing a single 8-bit ALU per chip and relying predominantly on large, multichip or wafer-scale
arrays to build up signiﬁcant spatial computations.
23The most direct descendent of the programmable cellular array research is the Conﬁgurable
ArrayLogic(CAL)ICfromTomKeanandAlgotronix[Kea89,GK89,Alg90]. CALusedaminimal
2-LUTforthebasiccellularelementandmostlynearest-neighborconnectionsforinterconnect. This
gives it a much ﬁnergrain than thecontemporary FPGAsfrom Xilinxwhich use4-LUTs and richer
interconnect.
3.3 Technological Enablers
The basic idea of conﬁgurable array computation has been around as long as the ideas for
central processor, stored program execution. So, why have programmable processors become the
mainstreamofgeneral-purposeprocessingwhile“reconﬁgurablecomputing”is onlynowemerging
as a competitive, general-purpose computing technology?
The answer lies with technology costs and application requirements. Active computing re-
sources have been a premium since the days of the vacuum tube. To realize general-purpose
computers, it took thousands of tubes to build a general-purposecomputer – making it infeasibleto
implement large, spatial computations. With the advent of core-memory, memory became moder-
ately dense compared to computing elements. To implement large, complex, computational tasks,
it was more efﬁcient to store large programs densely in memory and reuse a small amount of ﬁxed
logic.
The beginning of the MOS VLSI era reinforced these costs. Dense memories could be imple-
mented on silicon ICs. Because of high off-chip i/o costs, the critical unit became the amount of
logic or computation which could be placed on a single IC. The driving force has been to localize
computation to one or a small number of ICs to reduce costs and interchip communications. The
microprocessor was made successful by minimizing the amount of compute logic to the point
whereit would ﬁt onto a single IC. The critical turning point in processor development was when it
became possible put a competent processor on a single IC. The RISC structure became so success-
ful because it enabled early integration of such capable processors. Once single-chip processors
became possible, they rapidlyrose to dominate multichip implementations. While silicon area was
a premium,exploitingthe higherdensityof memories to storeprogramsandreuse the limitedspace
on the processor die was necessary. Today, we still see some premium to ﬁtting the kernel task
descriptions and their data into the limited memory available on the processor die.
The turning point for conﬁgurable hardware came when it was possible to place hundreds
of programmable elements on a single IC. At that point it became possible to realize regular
computations in space, dedicating each active computing element to a single task. Reconﬁgurable
computing began to take off as we could put 500-1,000 such programmable elements on a single
IC. Todaywe look atthousands of such elementsper IC and that number continues to increase with
thesilicon capacity. Atthousands to tensof thousandsof programmableelements, tightapplication
kernels can be spatially conﬁgured on one or a few conﬁgurable ICs without the need to share
active resources. This, in effect, caches the kernel not just in on-chip memory for use by a limited
amount of active processing elements, but right with the active processing elements such that a
large number may operate simultaneously.
Therewillalwaysbesomepremiumfordensetaskrepresentationtohandlethemostcomplicated
tasks. However, as the silicon real-estate becomes larger, the premium for dense task packing
24subsides making it more and more beneﬁcial to increase the on-chip silicon available for active
processing and remove the on-chip bottleneck between memory and processing elements. This
transition moves us to reconﬁgurable architectures.
25Part II
Empirical Review
264. Empirical Review of General Purpose Computing Architectures in the Age of MOS VLSI
Here we review various general-purpose computing architectures by taking an empirical look at
their implementations during the past decade. In this section we draw from the whole realm of
general-purpose architectures – not just those which ﬁt directly into our RP-space. This makes a
larger set of design points avilable for review, but also introduces considerably more variation in
architecturesthanwe will focuson in later chapters. We look primarilyat general-purposecapacity
in this section, generally ignoring the effects of specialized functional units. The following chapter
will look at the effects of custom multipiers, the most common specialized functional unit added
to nominally general-purpose computing devices. The focus here is on integrated, single IC,
computational building blocks to keep the comparison as consistent as possible across such a wide
varietyofarchitectures. Additionally,we focusentirelyon MOSVLSIimplementationssincemost
ofthesearchitecturehave hadmultipleMOS VLSIimplementationsandtheeffects ofMOSfeature
size device scaling are moderately well understood.
4.1 Processors
We start by looking at a simple RISC-style processor.
Model The pedagogical processor model (Figure 4.1) is composed of:
-bit ALU (two read, one write port from register ﬁle) [for the sake of comparison, we do
include multiple ALU processors in the empirical review]
memory (Register File, Instruction/Data Cache)
control (sequencer, cache control, load/store unit, etc.)
ALU(s)
Register
   File
External I/O
Proc
 Ctrl
Figure 4.1: Basic Organization for a Processor
27Gate
Instruction Evaluation Explanation
Capacity
LD, ST, MOVE 0 overhead allowing reuse and interconnect
ADD, SUB, CMP 2 Each full adder bit is 2 gate evaluations
AND, OR, XOR, One gate evaluation per bit
BEQ,BNE 3 -bit AND reduction 4 16 1
B, CALL, RETURN 0 ﬂow control overhead
SHIFT 0 interconnect
Table 4.1: Basic ALU Operations and Capacities
The ALU is the sole source of general-purpose capacity. Table 4.1 shows the traditional ALU
operationsprovidedbytheALUalongwiththecomputationalcapacityprovidedbyeachoperation.
An -bitALUprovides ALUbitoperations. Forthissimpleprocessor,nomultiplierorspecialized
coprocessorisincluded. WewilllookathardwiredmultiplyimplementationsseparatelyinChapter5
as an example of such specialized coprocessors. Each ALU operation operates in one processor
clock cycle.
Capacity Provided We extract a maximum of 2 gate evaluations ( ALU bit operations) per
cycle. Modern processors are achieving cycle times as low as 2-5ns with 128. The fastest,
single-ALU processors today thus offer a peak capacity around 84 gate-evaluations/ns. Table 4.2
compares several processor implementations over the past decade. Results are summarized there
in terms of ALU bit ops since that is the native, and hencemost accurate, unit for processors. From
Table 4.2, we see that conventional processors have provided a peak functionaldensity of 3-9 ALU
bit operations/ 2s over the past decade. We see from Table 4.1 and some simple weightings below
that an ALU bit op is somewhere between one half and two 3-LUT gate evaluations.
It is interesting, and perhaps a bit unexpected, to note how consistent this capacity density has
been over time. We might have expected:
1. delaystoimprovewithprocesssuchthat 2 wouldbeabettermeasureofprocess-normalized
capacity than 2s [ is the delay parameter for a process which nominally amounts to the
intrinsic delay for gates. One is the delay required for one inverter to drive a single
inverter of equal size.]
2. architectural or circuit design improvements to have increased functional density over time
Given the displayed consistency, we may be seeing compensating effects from:
1. velocity saturation in theCMOSdevices, especiallysubmicronCMOSpreventsthe expected
scaling
2. decreasing relative performance of on-chip interconnect with scaling
28Year Design Organization Die Size
2 area cycle ALU bit ops
2s
1984 RISC II 1 32 (100%) 4.3mm 7.7mm 1.5 15M 330 ns 6 5
[SKPS84]
1984 MIPS 1 32 (100%) 5.5mm 6.1mm 1 5 15M 250 ns 8 5
[RPJ 84]
1987 MIPS-X 1 32 (100%) 8mm 8.5mm 1 0 68M 50 ns 9 4
[HHC 87]
1987 PA-RISC 1 32 (100%) 8.4mm 8.4mm 0 75 125M 66 ns 3 9
[YFJ 87]
1988 SPARC 1 32 (100%) 12.1mm 12.7mm 0.75 273M 60 ns 2.0
[QC88, TFT 85]
1990 PA-RISC 1 32 (100%) 14mm 14mm 0 5 784M 11 ns 3 7
[TLB 90]
1990 SPARC 1 64 (75%) 14.9mm 15.1mm 0.4 1.4G 25 ns 2.4
[MMN 90] IEEE FPU (25%)
1992 SuperSparc 2 32 (82%) 16mm 16mm 0 4 1.6G 25 ns 2 0
[ANAB 92] IEEE FPU (18%)
1992 Alpha 1 64 (81%) 16.8mm 13.9mm 0 38 1.7G 5 ns 9 5
[DWA 92] IEEE FPU (19%)
1994 PA-RISC 2 64 (88%) 14mm 15mm 0.28 2.8G 7 ns 7 4
[RDB 94] IEEE FPU (12%)
1994 MIPS 1 32 (100%) 7.9mm 8.8 mm 0.2 1.7G 2 ns 9 1
[SYN 94]
1995 PowerPC 2 64 (87%) 18.2mm 17.1mm 0.25 5G 7.5 ns 3 9
[BBB 95] IEEE FPU (13%)
1995 UltraSparc 2 64 (84%) 17.7mm 17.8mm 0.25 5G 6 ns 5 0
[CDd 95] 2 FP/GFU (16%)
1995 SPARC V9 4 64 (80%) 297mm
2 0.2 7.4G 6.5 ns 6.6
[SPA 95] IEEE FPU (20%)
[CDF 95]
1995 Alpha 2 64 (90%) 16.5mm 18.1mm 0.25 4.8G 3.3 ns 9.0
[BAB 95] 2 IEEE FPU (10%)
1996 MIPS 1 32 9.1mm 8.3mm 0.25 1.2G 10 ns 2.6
[KDS 96]
1996 PA-RISC 2 64 (80%) 17.7mm 19.1mm 0.25 5.4G 4 ns 7.4
[LLNK96] IEEE FPU (20%)
1996 ARM 1 32 7.8mm 6.4mm 0.18 1.6G 5 ns 4
[MWA 96]
1996 Alpha 2 64 (95%) 14.5mm 14.4mm 0.18 6.8G 2.3 ns 8.6
[GBB 96] 2 IEEE FPU (5%)
ALU Bit Ops 2s
2
is the fraction of the die not including any specialized coprocessors – primarily
omitting any FPUs.
Table 4.2: Survey of Processor Capacity
29Year Design Ref.
1984 RISC II [SKPS84] 6.5 0 3.0 10 4
1984 MIPS [RPJ 84] 8.5 0 3.4 10 5
1987 MIPS-X [HHC 87] 9.4 7.5 10 6 1.5 10 5
1987 PA-RISC [YFJ 87] 3.9 0 8.2 10 6
1988 SPARC [QC88] 2.0 0 1.9 10 5
1990 PA-RISC [TLB 90] 3.7 0 1.3 10 6
1990 SPARC [MMN 90] 2.4 1.5 10 6 1.9 10 5
1992 SuperSparc [ANAB 92] 2.0 3.9 10 6 1.0 10 4
1992 Alpha [DWA 92] 9.5 1.2 10 6 4.1 10 5
1994 PA-RISC [RDB 94] 7.4 0 7.5 10 6
1994 MIPS [SYN 94] 9.1 1.5 10 7 5.3 10 6
1995 PowerPC [BBB 95] 3.9 1.9 10 6 6.0 10 6
1995 UltraSparc [CDd 95] 5.0 9.7 10 7 3.3 10 5
1995 SPARC V9 [SPA 95] 6.6 0 1.1 10 5
1995 Alpha [BAB 95] 9.0 4.8-57 10 7 1.5-18 10 5
1996 MIPS [KDS 96] 2.6 8.4 10 7 2.8 10 5
1996 PA-RISC [LLNK96] 7.4 0 4.7 10 7
1996 ARM [MWA 96] 4 2.5 10 6 8.0 10 5
1996 Alpha [GBB 96] 8.6 3.1-38 10 7 1.0-12 10 5
Table 4.3: Processor Capacity Summary
3. increasing chip size implies increasing wire runs – the area occupied by long interconnect
wires is not scaling with 2
4. increasing use of CAD – designers are giving up some density in order to manage the
complexity associated with the larger and larger designs
5. increasing gap between on-chip and off-chip performance necessitates dedicating more on-
chip area to non-active resources, particularly memory, to compensate for the i/o bottleneck
6. increasing area being dedicated to control and state management to prevent control and data
dependent stalls in the instruction stream
This peak computational density assumes that every operation on each -bit ALU performs an
-bit compute operation and the processor completes one instructions per ALU on every cycle. In
practice, a signiﬁcant number of processor cycles are not spent executing compute operations.
Limitedinstructionanddatabandwidth–coupledwithlonglatenciestooff-chipresources,
limited bandwidth can cause the processor to stall waiting on the information which it needs
to determine the course of the computation or the data it needs to operate upon. As a result,
the number of instructions completed per cycle is always less than the number of ALUs
available.
Abstraction overhead and data movement consume capacity – procedure calls, data
marshaling,traps,anddataconversionconsumecapacitywithout,directly,providingcapacity
30Application Gate Evaluations
Datapath Bit
GCC 0.5
TeX 0.5
US Steel 0.8
Average 0.6
Table 4.4: Average Gate Evaluations/Datapath Bit
to the problem. Operations which simply move data around do not provide capacity to the
problem either. Since we have a mix of the instructions shown in Table 4.1, we will
never achieve the peak condition where every operation provides 2 gate-evaluations to the
computational task.
Quantitative studies tell us, for given systems or application sets, average values for the number
of gate evaluations per instruction and the number of instructions executed per clock cycle. This
gives us an expected computational capacity:
E Functional Density
Gate Evaluations
Datapath Bit
Datapath Bits
Instruction
Instructions
Issue Slot
Issue Slots
Clock Cycle
1
area
For example, whileHaL’s SPARC64 should be able to issue 4 instructionsper cycle, in practice
it only issues 1.2 instructions per cycle on common workloads [EG95]. Thus Instructions
Issue Slot 30%,
resulting in a 70% reduction from expected peak capacity.
Assuming the integer DLX instructions mixes given in Appendix C of [HP90] are typical, we
can calculate Gate Evaluations
Datapath Bit by weighting the instructions by their provided capacities from
Table 4.1. In Table 4.4 we see that one ALU bit op in these applications is roughly 0.6 gate-
evaluations.
If this effect is jointly typical with the instructions per issue slot number above, then we would
yield at most 0 3 0 6 0 2 of the theoretical, functional density. For the HaL case, this reduces
6 6 ALU Bit Ops 2s to 1 3 gate-evaluations 2s.
There are, of course, several additional aspects which prevent most any application from
achieving even this expected peak capacity and which cause many applications to not even come
close to it.
1. Coarse-grained datapath/network – the word-oriented datapaths in conventional proces-
sors prevents efﬁcient interaction of arbitrary bits – e.g. A simple operation like XOR-ing
together two ﬁelds in a data word requires a mask, shift, and XOR operation even though
the whole operation may effect only a few gate evaluations. Returning to the HaL example
above, we note that the processor has a 64-bit datapath. When running code developed for
32-bit SPARCs, half of the datapath is idle, further reducing the yielded capacity by 50% to
0 7 gate-evaluations 2s.
31//R1 base pointer
//R2 base pointer
//R3 working sum
//R6 buffer top
ld -32[R1],R4 //value dropping out of window
sub R4,R3,R3 //subtract out
ld 0[R1],R4 //value entering window
add R4,R3,R3 //add in
asr R4,R5,#3 //divide by 8
st [R2],R5 //save new average
addi R1,#4,R1 //increment x ptr
ble R1,R6,top //branch to top
addi R2,#4,R2 //increment avg ptr (delay slot)
Figure 4.2: Inner Loop of Processor Implementation for Windowed Average
2. Limited control of ALU capacity – the -bit ALU mostly functions as a collection of
-bit processors controlled in SIMD fashion. Full capacity is only extracted when all bits
arranged in a dataword need the same operation. Data with smaller ranges (less than 2 ) do
not make full use of the capacity. Inhomogeneous operations (e.g. ADD the low 8-bits, XOR
bits 10-8, use this constant for bits 15-11, AND in bits 19-16, and OR together the remaining
bits) must be decomposed into sets of homogeneous operations.
Example: Average Calculation Consider a windowed average calculation performed on a pro-
cessor:
1
8
3 2 1 1 2 3 4
Figure 4.2 shows a possible inner loop of the windowed average calculation on a standard RISC
processor. Assuming one instruction per instruction slot, this sequence takes 9 instruction slots to
perform two potentially 32 bit adds – for a total of 128 gate evaluations. The loads, stores, and
shifts are all data movement operations and do not contribute to the actual computational task at
hand. The branch and increments are control overhead. Assuming this operation is performend on
a MIPS-X processor at 1 CPI, we yield a functional density:
128 gate evaluations
68M 2 9 cycles 50ns
4 2
gate evaluations
2s
Example: Parity Calculation Consider calculating the parity of a 32-bit word.
31 30 0
32//R1 input
//R3 parity output
asr R1,R2,#16 //align half words
xor R2,R1,R3 //16b xor
asr R3,R2,#8 //align bytes
xor R2,R3,R3 //8b xor
asr R3,R2,#4 //align nibbles
xor R2,R3,R3 //4b xor
asr R3,R2,#2 //align 2 bits
xor R2,R3,R3 //2b xor
asr R3,R2,#1 //align ﬁnal bits
xor R2,R3,R3 //ﬁnal xor
Figure 4.3: Processor Implemention for Parity Computation
Year Design Organization Die Size
2 area cycle ALU bit ops
2s
1990 LIFE-1 [LS90] 2 32 0 75 78mm
2 139M 20 ns 23
1993 VIPER [GNAB93] 4 32 0.6 12.9mm 9.1mm 326M 40 ns 9.8
ALU Bit Ops 2s
2
Table 4.5: Survey of VLIW Capacity
In 10 operations (See Figure 4.3), the processor can perform the 32b XOR required for the
parity calculation – 32 2-input gate evaluations or 11 4-input gate evaluations. Again, assuming a
MIPS-X like processor and 1 CPI, we yield:
11 gate evaluations
68M 2 10 cycles 50ns
0 32
gate evaluations
2s
4.2 VLIW Processors
Very Long InstructionWord (VLIW)machinesareprocessorswith multiple,parallel functional
units which are exposed at the architectural level. A single, wide, instruction word controls the
function of each functional unit on a cycle-by-cycle basis. Pedagogically, a VLIW processor looks
like a processor with multiple, independent functional units. At this level, the VLIW processor
does not look characteristically different from the modern superscalar processors included at the
end of the processor table.
Table 4.5 summarizes the characteristics of two VLIW processors. With only two datapoints it
is not possible to assess general trends. These examples seem to have about 2 the peak capacity
33Year Design
1990 LIFE-1 [LS90] 23 ? ?
1993 VIPER [GNAB93] 9.8 0 1.0 10 4
Table 4.6: VLIW Capacity Summary
of processors. To the extent this may be characteristic of VLIW designs, it may arise from the fact
that the separate functional units share instruction control and management circuitry more than in
superscalar processors.
VLIW processorsmayfailtoachievetheirpeak forthesamereasonsas processors. Inaddition,
they may suffer from:
Scheduling granularity – instructions must be statically scheduled in blocks at the VLIW
width. When this packing does not match the needs of the application, functional units may
go idle.
Mismatch in Functional Unit Mix – some VLIWs have a mix of functional units which
perform different operations. If the mix of operations in the application does not match the
functional unit mix at a ﬁne-grained level, functional units may go idle, lowering yielded
capacity.
Data Transfer – one way VLIWs may achieve a higher functional density than superscalar
processors is to segregatethe register ﬁles, reducing the interconnect required to deliver data
fromtheregisterﬁleto thefunctionalunitsandback. Somecycleswill berequiredto transfer
data between register ﬁles as necessary to allow the various functional units to cooperate on
a task.
4.3 Digital Signal Processors (DSPs)
Digital Signal Processors (DSPs) are essentially specialized microprocessors which:
Integrate a hardwired Multiplier Accumulator (MAC) unit
Includes specialized datapaths allowing
zero overhead loops
parallel increment of counters and pointers
The net effect of these additions is generally to increase the percentage of yielded capacity and
particularly to increase the yielded capacity on tight loop multiply operations.
Table 4.7 reviews several DSP implementations. On non-multiply operations, the peak perfor-
mance is generally lower than the processors. For the kinds of operations typical of DSPs, these
processors will generally yield much closer to their peak capacity than processors.
34Year Design Organization Die Size
2 area cycle ALU bit ops
2s
1985 [WDW 85] 1 16 15.5mm
2 1.0 16M 100 ns 10
1986 [vMWvW 86] 1 16 88.5mm
2 1.0 90M 125 ns 1.4
[Gol87]
1987 [KNK 87] 1 16 11.5mm 12.9mm 0.65 350M 50 ns 0.9
1987 [CBBF87] 1 32 6mm 8.5mm 0.5 200M 60 ns 1.3
1989 [PML 89] 1 16 71mm2 0.6 200M 100 ns 0.8
1992 [SKYH92] 1 16 9.5mm 10.5mm 0.6 275M 50 ns 1.2
1993 [USO 93] 1 16 9.3mm 9.1mm 0.4 530M 93 ns 0.33
1995 [NHK95] 1 32 8.5mm 8.5mm 0.25 1.2G 10 ns 2.8
1996 [KOT 96] 2 16 10mm 9.7mm 0.25 1.6G 25 ns 0.8
ALU Bit Ops 2s
2
Sizes above include MACs, which generally amount to 5-10% of the total DSP die area.
Table 4.7: Survey of DSP Capacity
Year Design
1985 [WDW 85] 10 0 1.5 10 5
1986 [vMWvW 86] 1.4 0 4.9 10 5
1987 [KNK 87] 0.9 2.9 10 6 9.3 10 5
1987 [CBBF87] 1.3 1.0 10 5 4.1 10 5
1989 [PML 89] 0.8 1.6 10 5 4.1 10 5
1992 [SKYH92] 1.2 1.4 10 5 3.0 10 5
1993 [USO 93] 0.33 1.5 10 5 1.5 10 4
1995 [NHK95] 2.8 8.9 10 7 2.9 10 5
1996 [KOT 96] 0.8 ? ?
Table 4.8: DSP Capacity Summary
4.4 Memories
Most general-purpose devices use memories to store instructions and data. A memory can
also be used directly to implement complex computational functions. For complicated functions,
a memory lookup can often provide high, programmable computational capacity. For just a few
examples see [Has87, HT95, RS92, Lev77].
Model We characterize a memory by:
depth – either in terms of address bits ( ) or total number of memory words (2 )
width – , the number of bits read out for each bits of address put into the memory
35out
a<0>
a<1>
a<2>
a<3>
a<4>
a<5>
a<6>
Figure 4.4: Gate Implementation of any Function Computed by 7-input Lookup Table
– the minimum time required between successive read operations
Capacity Provided The capacity provided by a memory is highly dependent on the inherent
complexity of the logic function being computed. The lower bound is trivially zero since we could
program the identify function into a memory.
We can use a counting argument to determine how complicated the functions can get. We start
by observing that an -input, one-output lookup table can implement 22 different functions. We
then consider how many gates it requires to implement any of these functions. Each gate can be
any function of four inputs, so each gate can implement 224
functions. A collection of gates can
thus implement at most 224
functions (less due to overcounting). In order to implement any of
the 22 functions provided by the table, we need at least:
22 224
2( 24)
2 24
2( 4) (4.1)
Conversely, by construction, we can show that any function computed by the input lookup
table can be computed with 2( 3) 1 gate evaluations. As suggested in Figure 4.4, we can use
2( 4) gates to select the correct functional value based on the low four bits of the address. We
then build a binary mux reduction tree to select the ﬁnal output based on the remaining address
bits. This tree requires 2( 4) 1 muxes. Together, the 2( 3) 1 gates compute any function
computable by the -input lookup table.
An input by one output table lookup can thus provide between 2( 4) 1 and 2( 3) 1 gate
evaluationsper cycle for themost complicated input functions. Sincethe bounds areessentially a
factorof two apart, we can approximatethe peak as 2( 4) gate evaluations per cycle. If the table is
36bits wide, the table provides at most times as many gate evaluations. Putting all this together,
we get:
2( 4)
area
4 2
Tables 4.9, 4.10 and 4.11 reviews memory implementations, showing the peak functional
density for each memory array. For the most complex functions, memories provide the highest
capacity of any general-purposearchitecture. For less complex operations, however, memories are
inefﬁcient, yielding very little of their potential capacity.
For example, an 8-bit add operation with carry output requires 16 gate evaluations. Performed
in a216 9 memory,such asa 9-bit versionofthe64K 18 memoryfrom[SMK 94], this provides
only2 4gate-evaluations 2s. Theinefﬁciency ofthememory-basedadder increaseswith operand
size since the number of gate evaluations in an -bit add increases as 2 whereas the memory area
increases as 22 1 .
For all the memories listed, the capacity is based on continuous cycles of random access. In
particular, nibble, fast page, or synchronous access in DRAMs is not exploited. For example,
[TNK 94] achieves 13,500 gate-evaluations 2s on random access. In sequential access mode,
the part can output 18 bits every 8ns. For large sequential access, this means an effective cycle
time of 8ns instead of the 48ns quoted – a factor of six improvement in cycle time and capacity.
Used in this mode, the peak performance is 81,000 gate-evaluations 2s.
Itis alsoworthnotingthat,unlikeprocessors,thecapacityofmemorieshasincreasedovertime.
This is likely due to:
increasedspecializationofthefabricationprocessestomemories–especiallytheintroduction
of three-dimensional structures in DRAMs, local interconnect, and high poly ﬁlm resistors
for SRAMs
increased pipelining of memory access
Modern processors actually dedicate a signiﬁcant portion of their area to memory. Table 4.12
summarizes the peak capacity the processor can extract by using table lookups in its D-cache. The
area used in calculating this capacity is the entire processor for the processors listed in Table 4.2.
Thispeakcapacitycanbethoughtofasthepeakcapacityonecouldextractfromeachloadoperation
when using the on-chip D-cache for table lookup operations.
37Year Design Organization Size
2 area cycle gate-evals
2s
1984 [BDN84] 16K 1 16.3mm2 1.0 16M 35 ns 1800
1984 [SMI 84] 32K 8 6.7mm 8.9mm 0.6 164M 46 ns 2200
1984 [YKK 84] 64K 1 4.7mm 6.6mm 0.75 55M 25 ns 3000
1984 [SCLB84] 4k 16 32.6mm
2 0.75 58M 30 ns 2400
1984 [MSM 84] 8K 8 6.0mm 6.8mm 0.6 113M 28 ns 1300
1984 [OKH 84] 2K 8 2.7mm 3.5mm 0.75 17M 16 ns 3800
1984 [CH84] 4K 4 11mm
2 1.05 10M 18 ns 5700
1984 [MMS 84] 64K 1 3.2mm 6.0mm 0.65 45M 20 ns 4600
1985 [YTN 85] 32K 8 5.0mm 9.2mm 0.65 110M 45 ns 3400
1985 [SAI 85] 32K 8 49.6mm
2 0.65 120M 45 ns 3100
1985 [SGS 85] 8K 8 31.8mm
2 0.75 57M 35 ns 2100
1985 [KEK 85] 32K 8 40.7mm2 0.6 110M 55 ns 2600
1986 [CC86] 8K 8 45mm
2 0.75 80M 35 ns 1500
1986 [KIK 86] 256k 1 4.5mm 10.6mm 0.5 190M 25 ns 3500
1986 [FRV 86] 64K 1 3.4mm 8.9mm 0.75 54M 13 ns 5900
1987 [WBS 87] 32K 8 6.8mm 9mm 0.6 170M 21 ns 4600
1987 [KTO 87] 128K 8 8mm 13.7mm 0.5 440M 35 ns 4300
1987 [WHS 87] 128K 8 5.5mm 14.8mm 0.4 510M 34 ns 3800
1987 [MOT 87] 128K 8 6.9mm 15.4mm 0.4 660M 25 ns 4000
1987 [GHS 87] 32K 8 9.5mm 7.8mm 0.65 175M 40 ns 2300
1988 [SKI 88] 128K 8 7.6mm 12.4mm 0.4 590M 44 ns 2500
1988 [WBEK 88] 1M 1 10.6mm 8.5mm 0.35 730M 29 ns 3100
1988 [CDH 88] 128K 8 12.2mm 7.7mm 0.35 770M 25 ns 3400
1988 [STT 88] 256K 4 7.5mm 12mm 0.35 730M 18 ns 5000
1988 [SHU 88] 256K 4 6.2mm 15.2mm 0.4 580M 15 ns 7500
1988 [KWA 88] 1M 1 5.5mm 15.7mm 0.35 710M 14 ns 6600
1988 [ONN 88] 32K 8 4.4mm 9.5mm 0.4 260M 7.5 ns 8400
1989 [VPP 89] 256K 1 3.9mm 9.5mm 0.35 300M 14 ns 3900
1989 [MMK 89] 512K 8 7.5mm 17.4mm 0.25 2.1G 25 ns 5000
1989 [SIY 89] 1M 1 5.3mm 10.3mm 0.25 870M 9 ns 8300
1990 [FPH 90] 256K 1 11.6mm 3.7mm 0.5 170M 8 ns 12000
1990 [ASO 90] 4M 1 7.7mm 18.6mm 0.28 1.9G 15 ns 9200
1990 [HKM 90] 4M 1 8.4mm 18.0mm 0.3 1.7G 20 ns 7800
1990 [SIS 90] 512K 8 7.2mm 16.9mm 0.25 1.9G 23 ns 5900
1990 [OHK 90] 512K 8 7.8mm 17.4mm 0.25 2.2G 23 ns 5200
1991 [CCS 91] 32K 16 11.1mm 10.1mm 0.4 700M 2 ns 23500
1992 [SSN 92] 32K 8 6mm 9mm 0.4 340M 200 ns 240
1992 [GOK 92] 2M 8 18.3mm 12.5mm 0.2 5.7G 12 ns 15300
1992 [MKS 92] 4M 4 10.4mm 21.5mm 0.2 5.6G 15 ns 12500
1992 [SIU 92] 256K 4 4mm 7.4mm 0.15 1.3G 7 ns 7200
1993 [SKS 93] 4M 4 9.7mm 21.9mm 0.18 6.9G 9 ns 16800
1993 [SUT 93] 4M 4 10.4mm 10.6mm 0.13 7.1G 20 ns 7400
1994 [IKM 94] 4M 4 10.3mm 20.9mm 0.2 5.4G 30 ns 6500
Table 4.9: Survey of Peak Memory Logic Capacity (SRAM)
38Year Design Organization Size
2 area cycle gate-evals
2s
1984 [MKS 84] 256K 1 6.3mm 6.3mm 0.75 70M 140 ns 1700
1984 [BCH 84] 256K 1 50mm
2 1.0 50M 150 ns 2200
1984 [MKM 84] 256K 1 30.2mm
2 0.8 47M 116 ns 3000
1984 [KFO84] 256K 1 46.8mm
2 0.6 130M 100 ns 1250
1984 [SNT 84] 128K 8 9.4mm 8.1mm 0.5 300M 120 ns 1800
1985 [KCE 85] 1M 1 5.5mm 10.5mm 0.5 230M 160 ns 1800
1985 [KFM 85] 1M 1 5mm 13mm 0.6 180M 260 ns 1400
1985 [SFO 85] 1M 1 5mm 12.5mm 0.6 170M 190 ns 2000
1985 [TJ85] 256K 4 6.0mm 11.4mm 0.6 190M 190 ns 1800
1986 [FSO 86] 1M 1 4.4mm 12.3mm 0.6 150M 190 ns 2300
1986 [TTS 86] 4M 1 6.2mm 16.0mm 0.4 620M 300 ns 1400
1986 [FOW 86] 4M 1 7.8mm 17.5mm 0.5 550M 200 ns 2200
1986 [KAI 86] 64K 4 3.1mm 6.9mm 0.6 59M 200 ns 1400
1986 [HOW 86] 1M 1 4.8mm 13.2mm 0.6 180M 260 ns 1400
1987 [MNA 87] 4M 1 4.9mm 14.9mm 0.4 450M 220 ns 2600
1987 [KSE 87] 4M 1 6.4mm 17.4mm 0.4 690M 230 ns 1600
1987 [OFW 87] 4M 1 6.9mm 16.1mm 0.45 550M 220 ns 2200
1987 [MYM 87] 256K 4 4.7mm 13.8mm 0.5 260M 220 ns 1100
1988 [YKMI88] 256K 16 7.5mm 12.7mm 0.4 600M 200 ns 2200
1988 [LCwH 88] 128K 4 8.1mm 9.6mm 0.5 310M 36 ns 2900
1988 [ANH 88] 16M 8.2mm 17.3mm 0.3 1.6G 180 ns 3700
1988 [IYK 88] 16M 1 5.4mm 17.4mm 0.25 1.5G 120 ns 5800
1989 [WOI 89] 4M 4 17.5mm 12mm 0.35 1.7G 190 ns 3200
1989 [FOS 89] 16M 1 7.9mm 17.4mm 0.3 1.5G 150 ns 4600
1989 [CTK 89] 16M 1 8.0mm 16mm 0.28 1.7G 150 ns 4100
1989 [AFM 89] 16M 1 7.7mm 17.5mm 0.25 2.2G 120 ns 4100
1989 [CKC 89] 16M 1 8.5mm 18.4mm 0.3 1.7G 190 ns 3200
1989 [LBK 89] 256K 4 6.8mm 12.3mm 0.5 335M 56 ns 3500
1990 [TTK 90] 16M 1 8.2mm 15.9mm 0.28 1.7G 150 ns 4100
1990 [KDK 90] 4M 1 5.6mm 15.2mm 0.35 700M 120 ns 3100
1990 [KSB 90] 16M 1 7.8mm 18.1mm 0.25 2.3G 150 ns 3100
1991 [NTT 91] 16M 4 9.7mm 20.3mm 0.15 8.8G 180 ns 2700
1991 [MMM 91] 64M 1 12.5mm 18.7mm 0.2 5.8G 120 ns 6000
1991 [TTU 91] 64M 1 19.9mm 11.3mm 0.2 5.6G 120 ns 6200
1991 [OTW 91] 64M 1 9.2mm 19.1mm 0.2 4.4G 90 ns 11000
1991 [YNH 91] 4M 16 234mm
2 0.2 5.6G 95 ns 7500
1991 [NNO 91] 4M 1 4.8mm 11.1mm 0.3 592M 60 ns 7400
1992 [HAH 92] 16M 1 8mm 16.6mm 0.3 1.5M 120 ns 6000
1992 [KDK 92] 512K 8 13.2mm 6.4mm 0.4 525M 60 ns 8300
1993 [STN 93] 16M 16 13.6mm 24.5mm 0.13 21G 60 ns 13100
1993 [KHK 93] 64M 4 14.4mm 33.2mm 0.13 30.6G 100 ns 5500
1994 [TNK 94] 1M 18 17.1mm 6.6mm 0.25 1.8G 48 ns 13500
1994 [AOT 94] 32M 8 13.3 22.8mm 0.13 19.5G 90 ns 9600
1994 [TTT 94] 32M 8 13.2mm 25.9mm 0.13 21.9G 56 ns 13700
1995 [SMK 94] 32K 9 1.7mm 5.0mm 0.4 53M 50 ns 7000
1995 [SMK 94] 64K 18 2.1mm 4.9mm 0.25 165M 40 ns 11200
Table 4.10: Survey of Peak Memory Logic Capacity (DRAM)
39Year Design Organization Size
2 area cycle gate-evals
2s
Pseudostatic
1984 [KSY 84] 32K 9 55mm
2 1 55M 125 ns 2700
1991 [SKK 91] 512K 8 6.5mm 14.2mm 0.4 580M 116 ns 3900
Virtually Static
1986 [NSS 86] 128K 8 6mm 13.8mm 0.5 330M 150 ns 1300
Table 4.11: Survey of Peak Memory Logic Capacity (Hybrid)
D-Cache
Year Design Ref. gate-evaluations
2s
1984 RISC II [SKPS84] 0
1984 MIPS [RPJ 84] 0
1987 MIPS-X [HHC 87] 0
1987 PA-RISC [YFJ 87] 0
1990 PA-RISC [TLB 90] 0
1990 SPARC [MMN 90] 39
1992 SuperSparc [ANAB 92] 250
1992 Alpha [DWA 92] 610
1994 PA-RISC [RDB 94] 0
1994 MIPS [SYN 94] 150
1995 PowerPC [BBB 95] 510
1995 UltraSparc [CDd 95] 330
1995 SPARC V9 [SPA 95] 0
1995 Alpha [BAB 95] 580
1996 MIPS [KDS 96] 170
1996 PA-RISC [LLNK96] 0
1996 ARM [MWA 96] 1000
1996 Alpha [GBB 96] 550
Table 4.12: Survey of Processor On-Chip Memory Capacity
404.5 Field-Programmable Gate Arrays (FPGAs)
Field-Programmable Gate Arrays (FPGAs) are composed of a collection of programmable
gates embedded in a programmable interconnect. Programmable gates are often implemented
using small lookup tables. The small lookup tables with programmable interconnect allow one to
take advantage of the structure inherent in many computations to reduce the amount of memory
and space required to implement a function versus the full memory arrays of the previous section.
Ultimately, this allows FPGA space required for an application to scale with the complexity of the
application rather than scaling exponentially in the manner of pure memories.
Model For pedagogical purposes, we consider an FPGA composed of:
, four-input lookuptables(4-LUTs) forgates withan optionalﬂip-ﬂop on the outputofeach
LUT which can be used for pipelining or data storage
“adequate” programmable interconnect to wire up functions using the 4-LUTs
a minimum operating cycle time, , which accounts for the time to travel through one
LUT and its associated interconnect.
Year Design Organization Size
2 area cycle gate-evals
2s
1986 Xilinx 2K [CDF 86] 1 CLB (4-LUT) 693 715 1 500K 20 ns 100
1988 Xilinx 3K 64 CLBs 5mm 6mm 0.6 83M 13 ns 120
[Xil89, HDJ 88] (2 4-LUT/CLB) (XC3020 die)
1991 UTFPGA [CSA 91] 3 4-LUTs 900 800 0.6 2M 7 ns 210
1992 Xilinx 4K 49 CLBs 4.8mm 4.6mm 0.6 61M 7 ns 230
[Xil94b] (2 4-LUT/CLB) (XC4005 Quadrant)
1994 LEGO [Seo94] 4 4-LUTs 1240 1184 0.6 4M 4.1 ns 240
1995 DPGA [TEC 95] 16 4-LUTs 1500 1750 0.5 10.5M 7 ns 210
1995 Xilinx 5K 49 CLBs 3mm 3.3mm 0.3 110M 6 ns 290
[Xil91] (4 4-LUTS/CLB) (XC5206 Quadrant)
1995 Altera Flex 8K 1008 LEs 8mm 10.5mm 0.3 930M 7.5 ns 144
[Alt95] (4-LUT/LE) (81188A die)
1995 ORCA 2C 256 PLCs 10mm 9.8mm 0.3 1.1G
2 7 ns 134
[ATT95] (4 4-LUT/PLC) (ATT2C10 die)
No context switch – 10 ns cycle for context switch
gate-evals 2s
4
2
Table 4.13: Survey of FPGA Capacity
41Year Design
1986 Xilinx 2K [CDF 86] 100 2.0 10 6 2.0 10 6
1988 Xilinx 3K [Xil89, HDJ 88] 120 1.5 10 6 1.5 10 6
1991 UTFPGA [CSA 91] 210 1.5 10 6 1.5 10 6
1992 Xilinx 4K [Xil94b] 230 1.6 10 6 1.6 10 6
1994 LEGO [Seo94] 240 9.8 10 7 9.8 10 7
1995 Xilinx 5K 290 1.8 10 6 1.8 10 6
1995 Altera 8K [Alt95] 144 9.3 10 7 9.3 10 7
1995 ORCA 2C [ATT95] 134 1.1 10 6 1.1 10 6
Table 4.14: FPGA Capacity Summary
Capacity Provided Running at full capacity and minimum operating cycle, the FPGA provides
gate evaluations per cycle. Modern FPGAs can hold on the order of 2000 4-LUTs and run
at cycle times on the order of 5-10ns. Table 4.13 computes the normalized capacity provided by a
few representative FPGAs. From these numbers we see that an FPGAs provide a peak capacity on
the order of 200-300 gate-evaluations
2s .
FPGA capacity has not change dramatically over time, but the sample size is small. There is a
slight upward trend which is probably representative of the relative youth of the architecture.
This peak, too, is not achievable for every application. Some effects which may prevent an
application from achieving this peak include:
Limited interconnect – When the network connectivity is inadequate, all of a device’s
capacity cannotbe used. This may require either that cells buffer and transmitdata or simply
that cells go unused in the area. Conventional FPGA interconnect routes most applications
with over 80% utilization.
Pipeline efﬁciency limits – In heavily pipelined systems, capacity can be required to pass
data across pipeline stages to all of its points of consumption. This transit capacity consumes
device capacity without contributing to the evaluation capacity required for the application.
Limitedabilitytopipelineoperations–Sometaskshavecyclicdependencieswherearesult
is required before the next round of computation can begin. Unless several, orthogonal tasks
are interleaved on the FPGA, the cyclic path length limits the rate at which resource can be
reused and, in turn, prevents the application from fully utilizing the FPGAs capacity.
Limited I/O Bandwidth – The I/O cycle time on most FPGAs is higher than the logic cycle
time shown in Table 4.13. Data transfer to and from the FPGA may limit the capacity which
can actually be applied to a problem.
Limited need for this functionality– If a piece of functionality implementedin an FPGA is
not required at the rate and frequency achievable on the FPGA, the FPGA can be employed
far below its available capacity.
42Register
Register
Register
Register
Register
Register
Register
a
d
d
Register
sub
Figure 4.5: Windowed Average – Pipelined FPGA Implementation
Need for additional functionality – When the FPGA cannot hold the required functional
diversity for a task and must be reprogrammed in order to complete the task, the device goes
partially or entirely unused during the reprogramming cycle.
As one example of pipelining, i/o, and functionality limitations, DEC’s Programmable Active
Memoriesran from15-33MHzfor several application [BRV92]. Atthese rates,the peak functional
density extractedfrom the XC3090’s employed was 13-26 gate-evaluations
2s , only about10-20% of
the potential functional density.
Example: Average Calculation Consider, again, our windowed average calculation:
1
8
3 2 1 1 2 3 4
Figure 4.5 shows a pipelined datapath to compute this windowed average. A 16-bit add on an
XC4000 part is operates in 21ns. Thus a cycle time of 42ns should be achievable if the ’s are 28
bits each – if they are 12-bit entities the 21ns cycle would be feasible. With the 32-bit datapath, the
impementation requires:
8 pipeline registers of 16 CLBs each (128 CLBs)
1 adder of 17 CLBs
1 subtractor of 17 CLBs
128 gate evaluations
1 25M 2 162 42ns
15
gate evaluations
2s
The cycle time can easily be cut in half by pipelining the two halves of each 32b operation,
effectively doubling yielded functional density.
Example: Parity Calculation Consider also the FPGA implementation of the 32-bit parity
calculation.
31 30 0
43data word input
parity out
Figure 4.6: 32-bit Parity – 4-LUT Implementation
The FPGA can build the 11 gate parity reduction (See Figure 4.6). The path is three gates long. At
7ns/gate, the unpipelined version operates in roughly 21ns.
11gate evaluations
1 25M 2 11 21ns
38
gate evaluations
2s
Pipeliningthe XOR-reduction, we can reducethe cycle time and increaseyield. If we pipeline at the
gate level and assume that we can only cycle the part at a 10ns cycle due to clocking limitations,
we yield:
11 gate evaluations
1 25M 2 11 10ns
80
gate evaluations
2s
4.6 Vector and SIMD Processors
Single-Instruction, Multiple-Data (SIMD) machines are composed of a number of process-
ing elements performing identical operations on different data items. Vector processors perform
identical operations on a linear ensemble of data items. At a pedagogical level vector processors
are essentially SIMD processors, though in practice the two architectures have traditionally been
optimized for different usage scenarios.
Model For pedagogical purposes, we consider a SIMD/Vector array composed of:
processingelements(orvectorunits)allofwhichperformthesameoperationoneachcycle
-bits wide processing element
44Year Design Organization Size
2 area cycle ALU bit ops
2s
1987 DEC MP [Gro87] 32 4 10.4mm 9.4 mm 1 98M 100 ns 13
1990 MP1 [Nic90] 32 4 11.6mm 9.5mm 0.8 170M 70 ns 11
1990 SLAP [FHR94] 4 16 7.9mm 9.2mm 1 73M 100 ns 8.8
1990 BLITZEN [HBD94] 128 1 11mm 11.7mm 0.5 514M 50 ns 5
1993 MP2 [KT93] 32 32 14mm 14mm 0.5 780M 80 ns 16
1994 IMAP [YKF 94] 64 8 15.5mm 15.6mm 0.28 3G 25 ns 6.6
1995 MIT Abacus 1000 PEs 6.5mm 7.3mm 0.5 190M 8 ns 660
[BSV 95] (2 3-LUTs/PE)
1995 MGAP-2 [GOI95] 2 2 0.4mm
2 0.4 2.5M 10 ns 160
1996 Sony [KHN 96] 4320 PEs 15.1mm 15mm 0.2 5.7G 20 ns 38
1996 PIP [AKY 96] 128 8 18.8mm 16.7mm 0.19 8.7G 33 ns 3.6
ALU Bit Ops 2s
2
Table 4.15: Survey of SIMD Processor Capacity
Year Design
1987 DEC MP [Gro87] 13 0 3.4 10 4
1990 MP1 [Nic90] 11 0 2.4 10 4
1990 SLAP [FHR94] 8.8 0 ?
1990 BLITZEN [HBD94] 5 0 2.5 10 4
1993 MP2 [KT93] 5.4 0 5.2 10 5
1994 IMAP [YKF 94] 6.6 0 6.7 10 4
1995 MIT Abacus [BSV 95] 660 0 3.0 10 4
1995 MGAP-2 [GOI95] 160 0 2.6 10 5
1996 Sony [KHN 96] 38 7.4 10 7 2.0 10 4
1996 PIP [AKY 96] 3.6 0 1.9 10 3
Table 4.16: SIMD Processor Capacity Summary
instruction control and distribution logic
a minimum operating cycle time, – the rate at which new instructions can be initiated
in the array
CapacityProvided TheSIMD/Vectorarrayprovidesprovidesapeakof ALUbitoperations
percycleor 2 gate-evaluationspercycle. Abacus,amodern,ﬁne-grainedSIMDarray, supports
1000 1-bit PEs and can operate at 125MHz. Abacus thus provides 660 ALU bit ops
2s . Table 4.15
computes the normalized capacity provided by several SIMD arrays of varying granularity, and
Table 4.17 shows the composition of a modern vector microprocessor.
SIMD/Vector arraysonly achievetheir peak capacity when every PE/VU is computing a useful
45Year Design Organization Size
2 area cycle ALU bit ops
2s
1995 [ABI 95] 1 32+16 32 16.75mm 16.75mm 0.5 1.1G 22 ns 22
ALU Bit Ops 2s
2
Table 4.17: Example Vector Processor Capacity
Year Design
1995 [ABI 95] 22 2.3 10 7 1.6 10 5
Table 4.18: Vector Processor Capacity Summary
logic operation on every cycle. Limitations to achieving this peak include:
Limited, local interconnect – PEs are typically connected only to a few neighbors. Every
communication operation occupies a PE without providingany gate-evaluationcapacity. On
SIMD arrays, when data is moved into the array, around in the array, or out of the array, PEs
can be occupied for several cycles without performing any logical operations.
Inhomogeneousoperation–AllPEsareonlyusefullyemployed whenthesameoperationis
required on every data bit. When this is not the case, many PEs sit idle or perform no useful
work. PEs are often masked out of operation in order to perform computations selectively
on data bits.
Flynn [Fly72] summarizes some of the limitations associated with SIMD processing.
Example: Average Calculation Returning to our windowed average calculation:
1
8
3 2 1 1 2 3 4
Here, we assume the data is resident in PE memory on the array. It could have been loaded via a
background load operation during a previous operation if it started off chip. Groups of 32 PEs are
assigned to each word. To perform the average we shift the target data across the array the 8 times
and accumulate at each group of 32 PEs as shown in Figure 4.7. The average takes a total of 30
cycles on 32 PEs to perform what we determined earlier to be 128 gate evaluations:
128 gate evaluations
0 19M 2 32 30 cycles 8ns
89
gate evaluations
2s
46Operation cycles
initialize result with local value 1 (assumed)
shift 1
accumulate 3
shift 1
accumulate 3
shift 1
accumulate 3
shift 1
accumulate 3
shift 1
accumulate 3
shift 1
accumulate 3
shift 1
accumulate 3
store result 1 (assumed)
Figure 4.7: Abacus (SIMD) Implementation of Windowed Average
Example: Parity Calculation Consider also the Abacus implementation of the 32-bit parity
calculation.
31 30 0
The SIMD array can perform a series 31 bitwise shift and xor operations to effect an XOR-scan.
The xor can be folded into the shift such that scan operation only takes 31 cycles for the shift-XOR
plus one to conﬁgure the operation. At the end of the scan operation, the partiy result is in the high
(or low) processor of each 32-bit word.
11 gate evaluations
0 19M 2 32 32 cycles 8ns
14 5
gate evaluations
2s
The data memory can be preloaded with a sequence of bypass operations to allow faster accumula-
tion. The scan can then be performed in log2 32 5 operations, where each operation is 3 cycles
long.
11 gate evaluations
0 19M 2 32 15 cycles 8ns
31
gate evaluations
2s
4.7 Multimedia Processors
Multimedia processors are a recent hybrid of microprocessors, DSP, and Vector/SIMD proces-
sors. Aimed at processing video, graphics, and sound, these processors support efﬁcient operation
on data of various grain sizes by segmenting up their wide-word ALUs to provide SIMD parallel
47Year Design Organization Size
2 area cycle ALU bit ops
2s
1995 [Sla95] 128 bits 290mm
2 0.25 4.6G 3.3 ns 8
1995 [Sla95] 128 bits 100mm
2 0.25 1.6G 1 ns 80
1996 [Eps95, TNH 96] 4 72 12.8mm 14mm 0.25 2.9G 16 ns 6.3
Experimental BiCMOS process
ALU Bit Ops 2s
2
Table 4.19: Multimedia Processor Capacity
Year Design
1995 [Sla95] 8 1.8 10 6 5.6 10 5
1995 [Sla95] 80 5.1 10 6 1.6 10 4
1996 [Eps95, TNH 96] 6.3 2.2-8.9 10 8 0.96-1.2 10 5
Table 4.20: Summary of Multimedia Processor Capacity
operation on the bytes within the word. This segmentation combats the increasing inefﬁciency
associated with processing small data values on wide-word processors.
FromTable4.19,weseetheCMOSmultimediaprocessorhavethesamepeakfunctionaldensity
as processors. The major difference is that the segmentation allows these processor to operate on
16-bit and byte-wide data without discarding a factor of 4-8 in performance. Of course, this is
true only as long as these ﬁner-grained operations can be performed efﬁciently in a SIMD manner.
The BiCMOSmultimedia processorpromisedby MicroUnitywouldhave asigniﬁcantlyhigher
performance density by exploiting a novel process. The comparison between their architecture in
CMOS and BiCMOS makes it clear that this functional density advantage comes primarily from
the process and not from the architecture.
4.8 Multiple Context FPGAs
Like FPGAs, multicontext FPGAs are composed of a collection of programmable gates em-
bedded in a programmable interconnect. Unlike FPGAs, multicontext devices store several con-
ﬁgurations for the logic and the interconnect on the chip. The additional area for the extra contexts
decreases functional density, but it increases functional diversity by allowing each LUT element to
perform several different functions.
Table 4.21 summarizes the capacities of some experimental, multiple-context FPGAs. Like
FPGAs, these devices may suffer from limited interconnect or application pipelining limits. The
additional context memory makes them less susceptible to functionality limits than traditional
components. Chapter 10 details the usage of multicontext devices including their relative capacity
yield compared to single context devices.
48Per
Year Design Composition Cycle Size 2 area cycle gate-evals
2s
1995 VEGA 2048 4-LUT 1 144mm
2 0.6 400M 10 ns 0.25
[JL95] (1 PE)
1995 DPGA 64 4-LUTs 16 1500 1750 0.5 10.5M 10 ns 150
[TEC 95] (subarray)
1996 TSFPGA 64 4-LUTs 2-8 1.1mm 1.2mm 0.25 21M 5 ns 19-76
[CD96] (subarray)
gate-evals 2s
4
2
Table 4.21: Survey of Multi-Context FPGA Capacity
Year Design
1995 VEGA [JL95] 0.25 5.1 10 6 5.1 10 6
1995 DPGA [TEC 95] 150 6.1 10 6 1.5 10 6
1996 TSFPGA 19-76 3.0 10 6 3-12 10 6
Table 4.22: Multi-Context FPGA Capacity Summary
Year Reference Organization Size
2 area cycle ALU bit ops
2s
1983 [LRSS84] 1 16 6mm 6mm 2.0 9M 4 140 ns 3
1991 [D 92] 1 32 150 mm
2 0.5 600M 62.5 ns 0.9
1991 [FKS91] 2 32 18.85mm 9.85mm 0.5 740M 20 ns 4.3
1992 [Sei92] 1 32 9.25mm 10.0mm 0.6 257M 33 ns 3.8
ALU Bit Ops 2s
2
4 3
Table 4.23: Survey of MIMD Processor Capacity
4.9 MIMD Processors
Contemporary MIMD processors have largely been built from collections of microprocessors.
As such, the functional density of these multiprocessors is certainly no larger than that of the
microprocessorsemployedforthecomputenodes. Sincethesemachinestypicallyrequireadditional
components for routing between processor and to connect processors into the routing network, the
average functional density is actually much lower.
Table 4.23 samples a few processors which were designed explicitly for multiprocessor im-
plementation. These processor integrate the basic network interface and, in some cases, a portion
of the routing network, onto the device. While the sample size is too small to draw any strong
49Year Design Organization Size
2 area cycle ALU bit ops
2s
1992 [CR92] 8 16 6.8mm 6.7mm (core) 0.6 126M 40ns 25
1995 [YR95] 48 16 11.5mm 11.2mm (core) 0.5 515M 20ns 75
1996 [MD96] 1 8 1.5mm 1.2mm (PE) 0.25 29M 10ns 28
ALU Bit Ops 2s
2
Table 4.24: Survey of Reconﬁgurable ALU Capacity
Year Design
1992 [CR92] 25 5.1 10 7 1.2 10 5
1995 [YR95] 75 7.5 10 7 8.9 10 6
1996 [MD96] 28 0.14-1.1 10 6 7.1 10 5
Table 4.25: Survey of Reconﬁgurable ALU Capacity
conclusions, the highest capacity implementations show only about half the functional density of
the microprocessors we reviewed in Section 4.1.
4.10 Reconﬁgurable ALUs
Reconﬁgurable ALUs are composed of a collection of coarse-grain ALUs embedded in a
programmableinterconnect. TheirwordorientationandlimitationtoALUoperationsdistinguishes
them from FPGAs.
Model For pedagogical purposes, a reconﬁgurable ALU contains:
, -bit ALUs
“adequate” programmable interconnect to wire up functions of the ALUs
a minimum operating cycle time, , which accounts for the time to operate in one ALU
or traverse the interconnect between ALUs.
optionally, a small instruction store associated with each ALU
Capacity Provided Running at full capacity and minimum operating cycle, the reconﬁgurable
ALU provides ALU bit operations per cycle. Experimental reconﬁgurable ALUs achieve
roughly 50 ALU bit operations/ 2s.
Like a processor D-cache, the memory on MATRIX can be used as a large lookup table. Using
the MATRIX 256 8 memory for function lookup, MATRIX can achieve up to 440 4-LUT gate-
evaluations/ 2s.
50Like processor, reconﬁgurable ALUs may suffer lower yield due to:
Mismatchedgrain-sizeandlimitedALUcontrol–Whenﬁne-grainoperationsarerequired,
the word-wide interconnect limits which bits may interact with each other. ALU operations
are word-wide SIMD making sub-word operations awkward and inefﬁcient.
Unlike processors, the reconﬁgurable interconnect allows these architectures to avoid much of
the data movement overhead necessary on processors. Like FPGAs, pipelining, interconnect, and
functionality limits may prevent full utilization.
Example: Average Calculation Returning to our windowed average calculation:
1
8
3 2 1 1 2 3 4
Figure 4.8 shows a pipelined datapath to compute this windowed average on MATRIX. In this
scheme, 4 BFUs (See Figure 1.4 and Chapter 13) are used to serve as an 8 value delay register, and
4 are used to perform the addition and subtration. Two cycles are required for each result so that a
single datapath can be used for the add and subtract and so that the single memory can provide one
read and one write cycle. The implementation yields:
128 gate evaluations
28 8M 2 9 2 cycles 10ns
25
gate evaluations
2s
Example: Parity Calculation Consider also the MATRIX implementation of the 32-bit parity
calculation.
31 30 0
The most straightforward implemenation, uses the memory as an 8-LUT to calculate the parity of
8 bit data chunks. A total of 5 such chunks will perform the entire calculation (See Figure 4.9).
Assuming pipelined operation of the ﬁrst four and ﬁnal reductions:
11 gate evaluations
28 8M 2 5 10ns
7 6
gate evaluations
2s
4.11 Summary
Table 4.26 summarizes the observed computational densities for the general-purpose architec-
ture classes reviewed in this section.
Memoriesprovidethehighestprogrammablecapacityofanyofthedevicesreviewed. However,
they only yield this capacity on the most complex functions – those whose complexity is, in fact,
exponential in the number of input bits. The capacity they provide is not robust in the face of less
complex tasks.
Reconﬁgurabledevices providethe highest general-purpose capacity which can be deployedto
application needs. Unlike memories capacity consumption scales along with problem complexity.
Their peak performance is 10 all non-reconﬁgurable architectures, with the exception of large,
51M
M
M
M
ptr
ptr memory calculate
read ptr avg avg+new
increment ptr modulo 8 write ptr new avg avg-old
Figure 4.8: Windowed Average – MATRIX Implementation
32b data word input
parity output
Figure 4.9: 32-bit Parity – MATRIX Implementation
52Architecture gate-evals
2s Limitations
Memory 1500-15000 most complicated functions only
SIMD (1000’s of PEs) 60-1200 highly homogenous computations only
FPGA 100-300 regular, highly pipelined, computations only
RALUs 50-150 semi-regular, word-wide operations
Few context DPGAs 30-150 semi-regular operations
Vector/VLIW 20-50 coarse-grain, semi-regular operations
SIMD (100’s of PEs) 10-30 homogenous computations
Processors/Multimedia 4-20 word-wide operations
DSPs 2-20 word-wide operations
Highly multicontext FPGAs 0.25
Table 4.26: General-Purpose Computational Capacity Summary
well engineered SIMD arrays. Fine-grained devices, such as FPGAs, are robust to grain-size
variation, as well. Reconﬁgurable architectures are not, however, robust to tasks with functional
diversity larger than the aggregate device capacity. Multicontext devices, such as the DPGA,
sacriﬁce a portion of the peak FPGA capacity density to partially mitigate this problem – providing
support for much higher on chip functional diversity.
Large SIMD or vector arrays have high peak performance because they ammortize a single
stream of instruction control, bandwidth, and memory among a large number of active computing
elements. They handle high diversity with the ability to issue a new instruction on each cycle.
However, theyrequireverylargegranularityoperationsinorderto efﬁcientlyusethecomputational
resources in the array.
Processors are robust to high functional diversity, but achieve this robustness at a large cost
in available capacity – 10 below reconﬁgurable devices. They also give up ﬁne-grain control of
operations, creating a potential for another 10 loss in performance when irregular, ﬁne-grained
operations are required. Vector and VLIW structures provide slightly higher capacity density for
very stylized usage patterns, but are less robust to tasks which deviate from their stylized control
paradigm.
Here we see distinctions in granularity, operation diversity, and yieldable capacity. The key
issues we used to classify architectures was the way the devices store and distribute instructions to
processing elements. Characterizing instructions and interconnect issues with a focus on RP-space
is the goal of Part III.
535. Case Study: Multiply
In this segment we review hardwired, programmable, and conﬁgurable multiply implementations.
The custom multiplier implementations show us the functional density achievable by custom
hardware on its intended task for comparison with the general-purpose structures reviewed in
Chapter 4.
We use the multiply operation for this comparison because it is relatively simple and important
to many computing tasks including signal processing. Because of its importance and regularity, it
has received much attention over the years including many, high quality, custom implementations.
Multiply is probably one of the ﬁrst computational operators to be implemented in most new
VLSI processes. Considering the amount of attention given to custom multiply implementations,
the comparison between custom multiplies and conﬁgurable implementations represents an upper
bound on the performance disparity between custom and conﬁgurable implementations. Few
functions, if any, should show a larger disparity, and most show a signiﬁcantly smaller disparity.
Multiply is also interesting since it is the ﬁrst piece of custom logic added to “general-purpose”
processors.
Inthissectionweuseadomainspeciﬁcmetricforfunctionalcapacity, themultiplybitoperation
(MPY bit op). To allow us to compare multiplies of various sizes, we assume each multiply
requires MPY bit ops. As such, we metric multiply functional density in MPY Bit Ops 2s
and compute it as shown in Equation 5.1.
2
5 1
An multiply can be done in less than 2 operations (see for example [Knu81]), but, for
the multiplies reviewed here, all of the circuits and algorithms do scale as 2 .
5.1 Custom Multipliers
Table 5.1 summarizes the performance of numerous custom multipliers according to Equa-
tion 5.1. Implementations range from sub 1000 to almost 9000 MPY Bit Ops 2s with 2000-
4000 MPY Bit Ops 2s representing the range of typical, high-performance, custom multipliers.
Like processors there is no clear trend for improvement with time or decreasing feature size. The
latestdesigns,ifanything,showatendencyto emphasizelatencyoverthroughputresultinginlower
functional density.
5.2 Semicustom Multipliers
Table5.2showsafew,sample,semicustommultiplierimplementations. At330and560MPYBit Ops 2s,
the gate array and standard cell implementations provide a factor of 5-10 less functional density
than the custom implementations.
54Year Design Organization Size 2 area cycle MPY bit ops
2s
1984 [LGC84] 8 8 1.25mm
2 1.5 0.56M 120 ns 960
16 16 5mm
2 1.5 2.2M 120 ns 960
1984 [UKY84] 24 24 3.8mm 3.8mm 1.0 14.4M 71 ns 560
1985 [GGA 85] 32 32 5.3mm 5.7mm 1.0 30M 56 ns 600
1985 [HFML85] 16 16 1.7 mm 1.7mm 0.75 5.1M 40 ns 1250
1986 [NSLKE86] 8 8 1.5mm 0.4mm 0.5 2.4M 3 ns 8900
1987 [LGS87] 8 8 0.61mm 0.58mm 0.5 1.4M 9.5 ns 4800
1988 [KKHY88] 32 32 3.2mm 5.2mm 1.0 17M 59 ns 1000
1988 [SJ88] 4 4 1.37mm
2 1.0 1.4M 16 ns 730
1989 [SH89] 64 64 3.8mm 6.5mm 0.8 39M 47 ns 2300
1989 [SLM 89] 16 16 1.55mm 1.44mm 0.25 36M 6.75 ns 1100
1990 [ADD90] 32 32 9880 mil
2 0.5 25.5M 35 ns 1150
24 16 3819 mil
2 0.5 9.9M 28 ns 1400
16 16 2888 mil
2 0.5 7.5M 22 ns 1600
1990 [YYN 90] 16 16 1.3mm 3.1mm 0.25 64M 3.8 ns 1000
1990 [SA90] 56 56 3.4mm 6.5mm 0.5 88M 30 ns 1200
1991 [MNH 91] 54 54 3.62mm 3.45mm 0.25 200M 10 ns 1500
1992 [FHT 92] 24 24 3.42mm 4.5mm 0.6 43M 30 ns 450
1992 [GSNS92] 54 54 3.36mm 3.85mm 0.4 81M 13 ns 2800
1993 [LS93] 12 12 2.5mm 3.7mm 0.5 37M 5 ns 780
1993 [SV93] 8 8 1.5mm 1.4mm 0.8 3.3M 4.3 ns 4500
1994 [KHANW94] 11 11 1.53mm
2 1.0 1.5M 22 ns 3600
11 16 0.9mm
2 0.6 2.5M 19 ns 3700
1995 [OSS 95] 54 54 3.77mm 3.41mm 0.125 823M 4.4 ns 810
1995 [IIF 95] 16 16 0.77mm 0.72mm 0.125 35M 10 ns 730
1996 [HKKM96] 54 54 17mm
2 0.15 760M 2.5 ns 1500
1996 [LE96] 4 4 0.224mm
2 0.5 0.90M 17 ns 1100
1996 [MNS 96] 54 54 3.1mm 3.1mm 0.25 150M 8.8 ns 2200
1996 [MYO 96] 32 32 2.35mm
2 0.2 59M 18 ns 980
Table 5.1: Survey of Multiplier Capacity
Year Design Organization Size
2 area cycle MPY bit ops
2s
Gate Array
1987 [BMNW87] 16 16 14.4mm
2 0.75 26M 30 ns 330
Standard Cell
1993 [FA93] 16 16 3mm
2 0.63 7.7M 60 ns 560
Layout Generator
1993 [FA93] 16 16 1mm
2 0.63 2.6M 40 ns 2500
Table 5.2: Sample Semi-Custom Multiplier Capacity
55Architecture Reference Multiply Op area and time MPY bit ops
2s
Processor
(basic ALU ops) [SKPS84] 8 8 41 instructions 0.3
16 16 81 instructions 0.7
(w/mstep) [Cho89] 8 8 10 instructions 2
16 16 18 instructions 4
32 32 34 instructions 9
(w/ booth step) [RPJ 84] 16 16 9 instructions 4
(w/ multiplier) [BBB 95] 64 64 2 per cycle 250
DSP (16 16 MAC) [WDW 85] 16 16 1 cycle 165
DSP (16 16 MAC) [Gol87] 16 16 1 cycle 23
DSP (16 16 MAC) [PML 89] 16 16 1 cycle 13
DSP (2 16 16 MAC) [USO 93] 16 16 0.5 cycles 10
DSP (32 32 MAC) [NHK95] 32 32 1 cycles 89
Memory [SMK 94] 8 8 1 64K 18 block 10
[SKS 93] 11 11 6 ICs 0.3
SIMD [BSV 95] 8 8 8 PEs, 66 cycles 80
16 16 16 PEs, 126 cycles 84
32 32 32 PEs, 235 cycles 90
(ALU only) [YKF 94] 8 8 1 PE, 40 cycles 1.3
(w/ lookup) 8 8 1 PE, 11 cycles 4.8
Vector
(w/ 16 16 mpy) [ABI 95] 16 16 8 per cycle 82
FPGA [ATT94] 8 8 27 PLCs, 19ns 30
[Alt96] 8 8 164 LEs, 49ns 8.6
[LE94] 8 8 66 CLBs, 102ns 7.6
16 16 102 CLBs, 152ns 13
32 32 174 CLBs, 254ns 18.5
200 200 930 CLBs, 1320ns 26
[ID95] 16 16 316 CLBs, 26ns 25
[ID95] 16 16 88 CLBs, 120ns 19
PADDI2 [YR95] 16 8 4 PEs, 50MHz 150
MATRIX 8 8 1 BFU, 20 ns 110
16 16 6 BFU, 20 ns 74
Table 5.3: Survey of Programmable Multiply Capacity
5.3 General-Purpose Multiply Implementations
For comparison, Table 5.3 summarizes the capacity density of several conﬁgurable and pro-
grammable implementations. Processors without specialized multiply support show a factor of
10,000 lowerperformancedensitythanhardwiredmultipliers. Processors,with multiplyorbooth
step operations have only a factor of 1,000 lower performance density. FPGAs are a factor of
100-300 less dense than custom hardware. Processors, DSPs, and reconﬁgurable ALUs with
integrated multipliers are only a factor of 10-20 lower in performance density. Figure 5.1 shows
these basic relationships.
56//R1,R2 hold inputs
ADD R0,R0,R3
//repeated for number of bits in R2 input
AND R1,#1,R4 //mask low bit
JUMP lequ,ZBITn //skip add if zero
SLL R1,#1,R1 //delayed branch slot
ADD R3,R1,R3 //add in scaled term
ZBITn: SRA R2,#1,R2 //scale for next add
//result in R3
Table 5.4: Multiply Using Standard ALU Operations
0.1 1 10 100 10,000 1,000
Multiply Bit Op Density
Custom
Semicustom
Processor
 ALU Ops
Processor
 w/ mstep
Processor
 w/ MPY   RALU
w/ MPY
FPGA
Figure 5.1: Comparison of Programmable and Custom Multiply Functional Densities
5.4 Hardwired Functional Units in “General-Purpose Devices”
One thing we note from Table5.3 is that processors with integrated multipliersprovide roughly
10% of the performance density of a custom multiplier. This comes about simply by dedicating
10% of the processor real-estate to hold a custom multiplier. Because of the importance of
the multiply function in many applications and the 100-1,000 performance density differential
achievable by setting aside this 10%, many processors and all DSPs augment the general-purpose
corewithahardwiredmultiplier. Custommultiplyandﬂoating-pointlogicarethetwomainpieceof
customlogicwhichhavebeen regularlyintegratedonto conventional“general-purpose”computing
devices for this reason.
57Structure Reference 64 64 54 54 32 32 16 16 8 8 4 4
Custom 64 64 [SH89] 2300 1600 560 140 35 9
Custom 54 54 [GSNS92] 2800 970 240 60 15
Processor
(w/multiplier) [BBB 95] 250 180 63 16 4 1
(w/mstep) [Cho89] 9 4 2 0.8
(ALU Ops) [SKPS84] 0.7 0.3 0.2
FPGA
[LE94] 23 21 19 13 8 3.5
Table 5.5: Yielded Multiply Capacity as a Function of Granularity
Architecture Reference Multiply Op area and time MPY bit ops
2s
Processor [Cho89] 8 8 8 instructions 2
16 16 16 instructions 5
Memory [SMK 94] 16 16 2 64K 18 block 19
[SKS 93] 22 22 11 ICs 0.7
FPGA [Cha93] 8 8 22 CLBs, 25 ns 93
16 16 84 CLBs, 40 ns 61
Table 5.6: Survey of Specialized Programmable Multiply Capacity
5.5 Multiplication Granularity
A custom multiplier is often called upon to perform multiplies for a variety of data sizes. When
multiplying operands smaller than the native multiply size, the custom multiplier yields lower
multiply functional density than indicated in Table 5.1. Table 5.5 compares the yielded capacity of
the various custom and programmable multipliers reviewed above.
5.6 Specialized Multiplication
In many applications, one of the operands in the multiply is a constant – or changing slowly.
In these case, the operation complexity is slightly reduced, in general, and may be greatly reduce
in particular circumstances. Hardwired, 2-operand, multipliers cannot take advantage of this
reduced complexity whereas programmable and conﬁgurable devices can. Table 5.6 summarizes
the multiply capacity provided on specialized multiplies. For comparison with the previous tables,
the multiply capacity density is calculated as if it is performing a full multiply. It might be
more accurate to say the complexity of the problem decreased rather than the density of multiply
bit ops increased, but the ratio of the performance density numbers is the same whichever way
we view it. Note that the densities shown in Table 5.6 apply for any constant operand. Particular
58operands may admit to much tighter implementations.
5.7 Summary
In general, reconﬁgurable devices achieve 100-300 lower capacity density than their custom
multiply counterparts. At the same time, they achieve 10-30 better performance than a processor
building a multiply out of ALU operations. For this particular operation, most processors include
a specialized multiply-step operation, which brings them closer to parity with the reconﬁgurable
devices,orintegrateacustommultiplier,whichgivesthema10 advantageoverthereconﬁgurable
devices. Reconﬁgurable devices which also include custom multiply support achieve about the
same multiply density as processor with integrated, custom, multipliers. When large, custom
multiplier arraysareused onsmall data, the gapbetween the customdevices and thereconﬁgurable
devicesnarrows. Similarly,whenamultiplyoperandisconstantorslowlychanging,reconﬁgurable
devices may exploit the reduction in operation complexity to narrow the density gap.
596. High Diversity on Reconﬁgurables
We have already noted that conventional FPGAs are poor at handling a functional diversity which
is larger thanthe aggregate functionalcapacity providedby a single device (Section 4.5). Handling
larger diversity may require reloading the FPGA programming, a slow process for conventional
FPGAs. During the reload time, the device goes largely unused. Alternately, a more generic
processing unit can be built on top of the FPGA and microsequenced like a processor. In the most
extreme case of spatial limitations, we might end up building a processor-like design on top of the
FPGA. Table 6.1 summarizesthe capacity density providedby several processors which have been
built on top of FPGAs.
From Table 6.1, we see that such processors, when optimized for the FPGA, have a peak
capacity of about2 ALU bit operations/ 2s, or aboutone fourth the capacity of a custom processor.
The architecture for R16 and jr16 are moderately straight RISC processor architectures, and are
likely to yield about the same fraction of this capacity as most other RISC processors.
At a 4 penalty from custom processors, for high diversity operations, one would certainly
be better off using, or building, a custom processor. As the commonality in the computational
task increases and the area available to the FPGA increases, the FPGA can build more application
specializedstructures,realizinghighercapacitydensity. Thissuggeststhereisacontinuumbetween
the most highly diverse functional operations, where FPGAs are 4 less dense than processors, to
the most regular operations, where FPGAs provide 10-100 more performance density.
It is also interesting to note that the performance density penalty for handling these highly
diverse operations on an FPGA is much less than the performance density penalty associated with
implementing a multiplication on the FPGA.
With only a 4 performance density penalty, an FPGA processor is roughly equivalent to a
Year Design Organization Design Size 2 area cycle ALU bit ops
2s
1991 Fliptronics R16 [Fre94] 1 16 150 XC4K CLBs 190M 50 ns 1.7
1994 nP [WHG94] 1 8 40 XC3K CLBs 52M 3 30 ns 1.7
1994 MacDLX [Dur94] 1 32 1000 XC4K CLBs 1.2G 500 ns 0.05
1994 jr16 [Gra94] 1 16 200 XC4K CLBs 250M 25 ns 2.6
1996 j32 [Gra96] 1 32 250 XC4K CLBs 310M 63 ns 1.6
1996 Hokie [GHH 96] 1 16 140 XC4K CLBs 175M 63 ns 1.5
ALU Bit Ops 2s
2
Table 6.1: Survey of FPGA-Implemented Processor Capacity
604 smaller processor. From table 4.2, we have seen aggregate processor capacity increase from
15M 2 in 1984 to 5G 2 in 1995, or about 70% per year. The 4 capacity density thus puts a
processor implemented on an FPGA implemented in a modern processes roughly equivalent to a
2.5-3 year old processor. As such, FPGA processors – which can ride the FPGA technology to
track technology advances – may be an attractive option for running legacy assembly code.
61Part III
Structure and Composition of
Reconﬁgurable Computing Devices
627. Interconnect
Programmable interconnect is the dominant contributor to die area and cycle time in conﬁgurable
devices. To support their large, active functional density, the computational units must be richly
interconnected and support highly parallel data routing. FPGAs, more than other general-purpose
devices, place most of their area into interconnect.
We review interconnect issues in the context of on-chip networks for reconﬁgurable archi-
tectures. We establish typical size and delay contributions by analyzing conventional FPGA
implementations, then we look at how resource requirements grow with increasing array size.
Understanding conventional sizes and growth factors help us characterize the design space. It also
serves as background context for the architectural developments described in Part IV.
In this chapter, we:
1. Decompose FPGA area into three component parts and establish the relative areas of each:
ﬁxed logic, conﬁguration memory, interconnect resources
2. Review issues in conﬁgurable network design
3. Establishgrowthratesfor interconnectanddescription requirementsas afunction ofnetwork
size
4. Establish relationships between network size and richness of network interconnect
5. Examine the efﬁciency of device utilization when viewed in relation to network resource
utilization rather than programmable gate utilization
6. Examine the effects of multibit granularity on interconnect resource requirements
7.1 Dominant Area and Delay
7.1.1 Fixed Area
Reviewing LUT-based FPGA implementations from Table 4.13, and calculating the area per
4-LUT (Table 7.1), we see that each 4-LUT is roughly 600K 2. The ﬂip-ﬂop and 16:1 LUT
multiplexor make up very little of this area, easily less than 20K 2. [BFRV92] estimates the area
of the 4-LUT multiplexor with ﬂip-ﬂop as 13K 2. In our own DPGA implementation these items
occupied 15K 2 (See Chapter 10). The majority of the area associated with each 4-LUT (97%),
goes into programmable interconnect and conﬁguration memory
This breakdown, alone, shows us one reason why a full 4-input lookup table is often used as
the programmable logic element, rather than a more restricted gate. The area required for the full
LUT, including its conﬁguration memory, is less than 10% of the area of the 4-LUT cell, such that
there is little advantage to reducing the cell’s functional size.
63Year Design 2 area / 4-LUT
1986 Xilinx 2K [CDF 86] 500K
1988 Xilinx 3K [Xil89, HDJ 88] 650K
1991 UTFPGA [CSA 91] 670K
1992 Xilinx 4K [Xil94b] 630K
1994 LEGO [Seo94] 1020K
1995 DPGA [TEC 95] 660K
1995 Xilinx 5K [Xil91] 560K
1995 Altera 8K [Alt95] 930K
1995 Orca 2C [ATT95] 1060K
Table 7.1: FPGA 4-LUT Size
Part Approximate Bits/4-LUT
Xilinx xc2k 160
Xilinx xc3k 100
Xilinx xc4k 200
Xilinx xc5k 120
UTFPGA 48
LEGO 120
DPGA 4 40
Altera 8k 190
Orca 2C 120
Table 7.2: Bits per 4-LUT
7.1.2 Interconnect and Conﬁguration Area
The number of programming bits per 4-LUT for these devices is summarized in Table 7.2.
Using a rather large memory cell ( 4.5K 2/bit), the memory accounted for 35% of the area on
UTFPGA. With 4-contexts and 600 2 3T-DRAM memory cells, memory only occupied 33% of
the area on the DPGA. If we assume 1000 2 static memory cells, for the Xilinx parts, memory
accounts for about 15-30% of that area (160K 2
500K 2 (32%), 100K 2
650K 2, (15%), 200K 2
630K 2, (32%), 120K 2
560K 2,
(21%)). Making similar assumptions, memory accounts for 21% of an Altera 8K part (190K 2
930K 2)
and 11% ( 120K 2
1060K 2) of an Orca 2C part. Interconnect and routing occupies the balance of the area
(70-90%).
7.1.3 Delay
Most vendors lump interconnect timing in with lookup table evaluation, making it difﬁcult to
distinguish the components of delay. Table 7.3 summaries interconnect and LUT logic delay for
Altera’s 8K series [Alt95] and our own experience with the DPGA (Chapter 10). From here, we
64Design Path Total Delay LUT delay Interconnect
Altera 8K LUT-local-LUT 2.5 ns 2 ns 20%
[Alt95] LUT-row-local-LUT 7.5 ns 2 ns 73%
LUT-row-column-local-LUT 10.5 ns 2 ns 81%
MIT DPGA LUT-LUT (in subarray) 3.5 ns 1.5 ns 60%
[TEC 95] LUT-xbar-LUT 7 ns 1.5 ns 80%
Table 7.3: FPGA Delay Breakdown
see that interconnect typically accounts for 80% of the path delay.
657.2 Problems with “Simple” Networks
FPGA networks, which already need to interconnect thousands of independent processing ele-
ments,donot, typically, looklike conventional multiprocessornetworks. Inparticular, a numberof
conceptually“simple” network structures commonly used as the basis for multiprocessor networks
do not scale properly for use in FPGAs. In this section, we review three typical organizations and
highlight their shortcomings on the scale required for FPGA networks.
1. crossbars
2. multistage networks
3. mesh networks
This review helps identify and motivate important design issues for reconﬁgurable interconnect
which we will address in the following section.
7.2.1 Crossbars
To guarantee arbitrary, full, connectivity among elements, we could build a a full crossbar for
the interconnection network. In such a scheme we would not have to worry about whether or
not a given network could be mapped onto the programmable interconnect nor would we have to
worry about where logic elements were placed. Unfortunately, the cost for this full interconnect is
prohibitively high.
For an element array where each element is a -input function (e.g. -LUT), the crossbar
would be an crossbar. Arranged in a roughly square array, each input and output must
travel distance, before we account for saturated wire density. Since interconnect delay is
proportional to interconnect distance, this implies the interconnect delay grows at least as .
However, the bisection bandwidth for any full crossbar is . For sufﬁciently large , this
bisection bandwidthrequires that the sideof an arraybe to accommodatethe wires across the
bisection. In turn, the bisection bandwidth dictates an area 2 . This also dictates input
and output wires of length . For large crossbars, wire size dominates the areas. These growth
rates are not acceptable even at the level of thousands of LUTs. If we were to build devices using
a single monolithic crossbar for interconnect:
area growth would be as the square of the number of LUTs supported
cycle time would slow down linearly with the number of LUTs in the network
Consider, for the sake of illustration, the size of a crossbar required to interconnect a 2,500
4-LUTdevice. We will assumetheminimum wire pitch is 8 andthe crossbar is implementedwith
two layers of dense metal routed at this minimum wire-pitch. The area of such an array, as dictated
simply by the wiring would be:
8 4 2500 8 2500 1.6G 2
66Makingforanareaof 1 6G 2
2500 640K 2 per4-LUTjusttohandletherequisitewiring. Conventional
FPGAs use a single SRAM cell to conﬁgure each of the crosspoints in the crossbar. If this were
done, the area would be memory bit dominated rather than wire dominated and take up:
1000 2 4 2500 2500 25G 2
Which results in 10M 2 per 4-LUT just to hold the conﬁguration memory. The area per LUT, of
course, continues to grow linearly in the number of LUTs for larger networks.
7.2.2 Multistage Networks
Multistage interconnection networks (e.g. butterﬂy, omega, CLOS, Benes) can reduce the total
number of switches required from 2 to log , but have the same bisection bandwidth
problem. Between any two pair of stages in a butterﬂy network, the total bisection bandwidth is
, such that the wiring requirements dictate that area grows at 2 .
7.2.3 Mesh Interconnect
Attheoppositeinterconnectextreme,wecanuseonlylocalconnectionswithinthearraybetween
adjacent, or close, array elements. By limiting all the connections to ﬁxed distances, the link delay
does not grow as the array grows. Further, the bisection bandwidth in a mesh conﬁguration is
and hence, never dominates the logical array element size. However, communicating a
piece of data between two points in the array requires switching delay proportional to the distance
between the source and the destination. Since switching delay through programmableinterconnect
is generally much larger than fanout or wire propagation delay along a ﬁxed wire, this makes
distant communication slow and expensive. For example, in a topology where direct connections
are only made between an array element and its north, east, south, and west neighbors (typically
called a NEWS network), a signal must traverse a number of programmable switching elements
proportional to the Manhattan distance between the source and the destination ( ). For the
interconnect network topologies typically encountered in logic circuits, this can make interconnect
delay quite high – easily dominating the delay through the array element logic.
677.3 Issues in Reconﬁgurable Network Design
With this background, we can begin to formulate the design requirements for programmable
interconnect:
1. Provide adequate ﬂexible – The network must be capable of implementing the intercon-
nection topology required by the programmed logic design with acceptable interconnect
delays.
2. Use conﬁguration memory efﬁciently – Space required for conﬁguration memory can
accountforareasonablefractionofthearrayreal-estate,aswesawinSection7.2.1. However,
as we will see in Section 7.8, conﬁguration encodings can be tight and do not have to take up
substantial area relative to that required for wires and switches.
3. Balancebisectionbandwidth–Asdiscussedabove,interconnectwiringtakesspaceandcan,
insometopologies,dominatethearraysize. Thewiringtopologyshouldbechosentobalance
interconnect bandwidth with array size and expected design interconnect requirements.
4. Minimize delays – The delay through the routing network can easily be the dominant delay
in aprogrammabletechnology(SeeSection7.1.3). Careis requiredto minimizeinterconnect
delays. Two signiﬁcant factors of delay are:
(a) Propagationandfanoutdelay–Interconnectdelayonawireisproportionaltodistance
and capacitive loading (fanout). This makesinterconnect delay roughly proportionalto
distance run, especially when there are regular taps into the signal run. Consequently,
small/short signal runs are faster than long signal runs.
(b) Switchedelement delay–Each programmableswitchingelementina path(e.g. cross-
bar, multiplexor) adds delay. This delay is generally much larger than the propagation
or fanout delay associated with covering the same physical distance. Consequently,
one generally wants to minimize the number of switch elements in a path, even if this
means using some longer signal runs.
Switchingcanbeusedtoreducefanoutonalinebysegmentingtracks,andlargefanoutcanbe
used to reduce switching by making a signal always available in several places. Minimizing
the interconnect delay, therefore, always requires technology dependent tradeoffs between
the amount of switching and the length of wire runs.
68LUT
Interconnect
Figure 7.1: Conventional FPGA Interconnect Topology
7.4 Conventional Interconnect
Conventional FPGA interconnect takes a hybrid approach with a mix of short, neighbor con-
nections and longer connections. Figure 7.1 shows a canonical FPGA LUT tile. Full connectivity
is not supported even within the interconnect of a single tile. Typically, the interconnect block
includes:
A hierarchy of line lengths – some interconnect lines spana single cell, some a smallnumber
of cells, and some an entire row or column
Limited, but not complete, opportunity for corner turns
Limited opportunity to link together shorter segments for longer routes
Options for the value generated by the LUT to connect to some lines of each hierarchical
length in each direction – perhaps including some local interconnect lines dedicated to the
local LUT output
Opportunity to select the -LUT inputs from most of the lines converging in the interconnect
block
The amount of interconnect in each of the two dimensions is not necessarily the same. Figure 7.2
shows these features in a caricature of conventional FPGA interconnect.
The University of Toronto has performed a number of empirical interconnect studies aimed at
establishing basic FPGA interconnect characteristics, including:
How densely to populate the interconnect with switches and the number of routing tracks
required to route representative circuits [RB91]
The merits of hierarchical interconnect [AL94]
69LUT
LUT
LUT
LUT
LUT
LUT
Switch between orthogonal lines Switch between segments
Figure 7.2: FPGA Interconnect Caricature
The distribution of line lengths [Seo94]
One of the key differences between FPGAs and traditional “multiprocessor” networks is that
FPGA interconnect paths are locked down serving a single function. The FPGA must be able to
simultaneously route all source-sink connectionsusing unique resources to realize the connectivity
requiredbytheFPGA.Anotherkeydifferenceisthattheinterconnectionpatternisknownapriorto
execution, so ofﬂine partitioning and placement can be used to exploit locality and thereby reduce
the interconnect requirements.
707.5 Switch Requirements for FPGAs with 100-1000 LUTs
Before we examine how network requirements scale with connectivity and network size, in
this section, we brieﬂy review the number of switches conventionally employed by networks
supporting 100 to 1000 4-LUTs. Brown and Rose [RB91, BFRV92] suggest each 4-LUT in a
moderate sized FPGA with 100’s of 4-LUTs will require 200-400 switches. Agarwal and Lewis
suggest approximately 100 switches per LUT for hierarchical FPGAs [AL94] with some reduction
inlogic utilization. Conventional,commercialFPGAsdolittleorno encodingontheirinterconnect
bit streams – that is, each interconnect switch is controlled by a single conﬁguration bit. From the
conﬁguration bit summary in Table 7.2, we see that commercial devices also exhibit on the order
of 200 switches per 4-LUT. The fact that conventional FPGAs can, with difﬁculty, route most all
designs using less than 80-90% of the device LUTs, suggests that they chose a number of switches
which provides reasonably “adequate” interconnect for the current device sizes – hundreds to a
couple of thousand 4-LUTs.
717.6 Channel and Wire Growth
In Sections 7.1 and 7.5, we have empirically established the size of conventional interconnect.
However, as we glimpsed in Section 7.2, the area which these resources occupy is not necessarily
independent of the number of LUTs interconnected. In this section we look at how interconnect
requirements will grow with the number of LUTs supported.
The best characterization to date which empirically meters interconnect requirements is Rent’s
Rule [LR71, Vil82]:
7 1
is the number number of interconnection in/out of a region containing . and are
empirical constants. For logic functions 0 5 0 7, typically.
El Gamal used a stochastic model to estimate the interconnection requirements for channeled
gate arrays [Gam81]. He found that each routing channel requires tracks if the average wire
length, , growsfaster than log . hereis thetotal numberofcircuitsin thearray, generally
arranged in an array. Brown used El Gamal’s routing model for FPGAs and found good
correspondence between it and FPGA interconnect requirements [Bro92]. For large numbers of
gates ( ) and 0 5, Donath ﬁnds that
0 5 [Don79]. Together this means the
channel width grows as
0 5 . From which we can derive the interconnect requirements
growth:
2
0 5 2
2 (7.2)
2
3 isoftenconsideredagood,conservative,valuefor tohandlemostinterconnectrequirements.
For 0 5, Donathﬁnds that grows as log or smaller. For ln ,
El Gamal’s model suggests the the track width grows as ln . In this case, total intercon-
nect requirements grow as log2 .
7.6.1 Rent’s Rule Based Hierarchical Interconnect Model
To make this size estimate more concrete, let us consider a speciﬁc structure built according
to Rent’s Rule. We build a fully hierarchical interconnect with inter-level signaling bandwidth
growing according to Rent’s Rule. To simplify analysis, we consider only unidirectional signal
wires.
The gates are recursively partitioned into equally sized sets at each level of the hierarchy.
The principal interconnect occurs at each node of convergence in the hierarchy (See Figure 7.3).
At a level in the hierarchy, each node has a fan-in from below of 1 signals and a fan-in
from above of . Similarly, it has a fan-out of 1 toward the leaves and towards
72noutl ninl
nin
nout
nin
nout
Figure 7.3: Logical Structure of Hierchical Interconnect
the root. At each level , we have LUTs, external inputs, and external outputs.
According to the hierarchical combining and Rent’s Rule growth, we have:
(7.3)
We take , the numberof LUTinputs. When , whichwill be true for small
, we take – that is, all outputs are passed out of the region when this Rent bandwidth
permits.
Logically, we have 1 distinct output directions from each node of convergence in the
interconnect – for the leaves, plus one for the root. Allowing full connectivity within each tree
node, each of the leaves picks its inputs from the ( 1) outputs from its siblings
and from the inputs from the parent node. The outputs of this node are selected from
73Up XBAR
nin
nout
noutl ninl
nout
nin
Down
XBAR
Down
XBAR
Figure 7.4: Switching node in 2-ary Hierarchical Interconnect
the outputs from all subtrees converging at this point. Figure 7.4 shows this basic
arrangement for 2.
7.6.2 Wire Growth in Rent Hierarchy Model
First, let us consider how wiring resources grow in this structure. At each stage of the
hierarchy, there are wires coming and leaving each subarray. This
makesthe bisection width of . For a two-dimensionalnetworklayout,
this bisection width must cross out of the subarray through the perimeter. Thus the perimeter of
each subarray is . The area of the subarrays will be proportional to the square of its
perimeter, making:
2
The area required for each LUT based on wiring constraints, then, goes as:
2
(2 1) 7 4
Not unsurprisingly, this matches the interconnect growth we derived in Equation 7.2. Of course,
if 0 5, wiring is not the dominant resource constraining LUT area. may be 1 for
0 5 as far as strict wiring requirements are concerned.
747.6.3 Switch Growth in Rent Hierarchy Model
We can also look at the number of switches required if each of the logical switching units is a
fully-populated crossbar. At each level, , the total number of switches is:
inputs to down xbar
1 1 1
each down xbar
1
up xbar
1 ( 1) ( 1) ( 1)
1 ( 1) ( 1) ( 1)
1 2 2 ( 1) 2 2 ( 1)
2 2 ( 1) 1
2 2 ( 1) 1 2 (7.5)
Amortizing across the number of LUTs supported at level , we can count the number of switches
per LUT at each level:
2 2 ( 1) ( 1 2 )
2 (2 2 1) 1 2
2 (2 1)( 1) 1 2 (7.6)
Summing across all levels, we can thus calculate the number of switches per LUT as a function of
the size of the network.
log
1
2 (2 1)( 1) 1 2
Substituting for , and expanding sum:
2 (2 1) 1
2 1
1
2 1
2 1
2 1 1 2 7 7
For 0 5, this gives us:
2 (2 1)
1
2 1
1
2 1
(log 1)
1 1
2 1
1 2
752 (2 1)
1
2 1
1 1
2 1
1 2
(2 1) 2 1
2 1 1
1 2 (7.8)
For 0 5, each sum term in Equation 7.7 goes to one:
0 log ( ) 2 1 2
log ( ) 2 1 2 (7.9)
For 0 5,
2 (2 1) 1 2 1 2 log ( ) 1 2 2 1 2
2 (2 1) 1 2 1 2 log ( ) 1 1
(1 2 )
log ( )
1 1
(1 2 )
2 (2 1) 1 2 1 2 log ( ) 1
1 1
(1 2 )
2 (2 1) 1 2 ( 1 2 log ( ))
(1 2 )
(1 2 ) 1
2 (2 1) 1 2
log
(1 2 ) (1 2 )
(1 2 ) 1
2 (2 1) 1 2
(1 2 )
(1 2 )
(1 2 ) 1
(2 1) (1 2 ) 2 1 2
(1 2 )
(1 2 ) 1
2 1 2
(1 2 )
(1 2 ) 1
(7.10)
Putting these cases, together:
2 ( 1 2 )
(1 2 )
(1 2 ) 1 0 5
log ( ) 2 1 2 0 5
(2 1) 2 1
2 1 1 ( 1 2 ) 0 5
7 11
Here, wesee switchingareaperLUTgrowsas 1 , for 0 5,and
(2 1) for 0 5.
Again, this matches our wiring growth expectations (Equation 7.2).
While Equation 7.11 gives the correct growth rates it overestimates the required number of
switches on two accounts:
76 Rent p=0.50 equation
 Rent p=0.50 direct
|
1 |
2 |
4 |
8 |
16 |
32 |
64 |
128 |
256 |
512 |
1024 |
2048 |
4096 |
8192 |
16384
| 0
| 100
| 200
| 300
| 400
| 500
| 600
| 700
| 800
| 900
 
S
w
i
t
c
h
e
s
/
L
U
T
 NLUT
4, 2, 0 5
Figure 7.5: Switches per LUT – Equation versus Direct Calculation
1. It does not take into account the limited number of distinct outputs at the lowest stages of the
network – i.e. when there are less outputs than the Rent i/o suggests.
2. It approximates each crossbar as requiring switches. However, since each crossbar is
performing an choose operation, only 1 crosspoints are actually required to
provide full connectivity at each tree interconnect node.
Figures 7.5 and 7.6 show the difference between Equation 7.11 and a direct calculations which
includes the above two effects. Asymptotically, the difference is in the constant factor. Note that
for 0 5, the number of switches per LUT computed by the direct calculation in the 256–1024
LUT range is 190-250, which is on par with contemporary interconnects (Section 7.5).
77 Rent p=0.67 equation
 Rent p=0.67 direct
|
1 |
2 |
4 |
8 |
16 |
32 |
64 |
128 |
256 |
512 |
1024 |
2048 |
4096 |
8192 |
16384
| 0
| 1000
| 2000
| 3000
| 4000
| 5000
| 6000
| 7000
 
S
w
i
t
c
h
e
s
/
L
U
T
 NLUT
4, 2, 0 67
Figure 7.6: Switches per LUT – Equation versus Direct Calculation
787.7 Network Utilization Efﬁciency
In the previous section, we saw that the amount of interconnect we need to provide depends
upon the connectivity of the network. This makes it difﬁcult to design a single network which will
efﬁciently accommodate arbitrary designs. If the design has limited connectivity, but the network
provides a large amount of connectivity, the network is over designed relative to the design and
provides less functional density than achievable. If the design has considerable connectivity, but
the network provides less, the design must be routed sparsely on the interconnect, leaving many of
the device LUTs unusable.
Using the switching models derived in the previous section, we can examine the relative
inefﬁciencies of using a design with Rent exponent on a network with Rent exponent .
We do this by looking at the ratio of the area occupied by a design with LUTs and on top of
a network built using Rent exponent . If , then the ratio is simply the ratio of the
area per LUT of a interconnect of LUTs to the area per LUT of a interconnect
of LUTs. However, if , we cannot simply map the design netlist on top of the
device LUTs. Here, we have to ﬁgure out how muchlarger thenetwork mustbe than the numberof
LUTs in the design in order to accommodate the highly connected design. Let us call this scaling
factor . In order for the network to accommodate the design, it must have enough i/o bandwidth
into each subregion. Starting at the top level in the design, this means:
The only way to accommodate this requirement with a ﬁxed is to scale up the network used.
Applying Rent’s Rule (Equation 7.1), this means:
( )
Solving this relation for equality:
( )
( 1) (7.12)
Note that once we accommodate the top level of the design, all other levels are also accommodated
as well. That is, once we have chosen as above, at the top level:
( ) 7 13
Since , at level 1, the connectivity required for the design will shrink faster than
thenetworkconnectivity,so lowerlevelsaresatisﬁedbythesamescaleupfactorwhichsatisﬁesthe
top level in the design. The overhead ratio for the case, then, is the ratio of the size
of a interconnect with Rent exponent compared to the size of an interconnect
with Rent exponent .
In making this area comparison, we assume that switching area dominates non-switching area,
andweapproximateLUTareaasproportionaltothenumberof switches. FromSection7.1, wesaw
79thatthisistrueofconventionaldevices. Intheprevioussection,wesawthatswitchingrequirements
grow at least as fast as wires, and generally faster than non-switching resources. This suggests that
switching area will continue to dominate non-switching area as device capacities grow.
If we solve for strictly according to Equation 7.13, the ratios are continuous and do not
take into account the discretizationaffects associated with network size and levels. The continuous
approximationgivesusasmoothwaytocomparegeneraloverheadgrowthtrends. Figure7.7shows
both the discrete and continuous comparisons for various implemented on a network
with 0 5 as a function of . Figure 7.8 similarly shows the relative overheads for
implementing designs with 4096 on designs. Figure 7.9 plots the same data as
the continuous case from Figure 7.8 on three axes. Figure 7.10 plots the continuous efﬁciency, the
inverse of overhead, and Figure 7.11 plots the continuous efﬁciency on a logarithmic scale.
80|
1
|
2
|
4
|
8
|
16
|
32
|
64
|
128
|
256
|
512
|
1024
|
2048
|
4096
|
8192
|
16384
|
32768
|
65536
|
131072
| 0
| 2
| 4
| 6
| 8
| 10
| 12
| 14
| 16
| 18
| 20
| 22
 
O
v
e
r
h
e
a
d
 pdes=0.30 
 pdes=0.40 
 pdes=0.50
 pdes=0.60 
 pdes=0.70 
 pdes=0.80  
 NLUT
 pdes=0.30 
 pdes=0.40 
 pdes=0.50 
 pdes=0.60 
 pdes=0.70 
 pdes=0.80 
|
1
|
2
|
4
|
8
|
16
|
32
|
64
|
128
|
256
|
512
|
1024
|
2048
|
4096
|
8192
|
16384
|
32768
|
65536
|
131072
| 0
| 2
| 4
| 6
| 8
| 10
| 12
 
O
v
e
r
h
e
a
d
 NLUT
4, 2, 0 5
Top - discretized ratios; Bottom - continuous ratios
Figure 7.7: Overhead Growth versus for various
81|
0.00
|
0.10
|
0.20
|
0.30
|
0.40
|
0.50
|
0.60
|
0.70
|
0.80
|
0.90
|
1.00
| 0
| 2
| 4
| 6
| 8
| 10
| 12
| 14
| 16
| 18
| 20
 
O
v
e
r
h
e
a
d
 
R
a
t
i
o  pnet=0.20 
 pnet=0.30 
 pnet=0.40 
 pnet=0.45 
 pnet=0.55 
 pnet=0.60 
 pnet=0.70 
 pnet=0.80 
 pdes
|
0.00
|
0.10
|
0.20
|
0.30
|
0.40
|
0.50
|
0.60
|
0.70
|
0.80
|
0.90
|
1.00
| 0
| 2
| 4
| 6
| 8
| 10
| 12
| 14
| 16
| 18
| 20
 
O
v
e
r
h
e
a
d
 
R
a
t
i
o  pnet=0.20 
 pnet=0.30 
 pnet=0.40 
 pnet=0.45 
 pnet=0.55 
 pnet=0.60 
 pnet=0.70 
 pnet=0.80 
 pdes
4, 2, 4096
Top - discretized ratios; Bottom - continuous ratios
Figure 7.8: Overhead for versus
820.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
pnet
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
pdes
2
4
6
8
10
overhead
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
pdes
2
4
6
8
10
overhead
4, 2, 4096
Figure 7.9: Continuous Overhead for versus
830.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
pnet
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
pdes
0.2
0.4
0.6
0.8
1.0
efficiency
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
pdes
0.2
0.4
0.6
0.8
1.0
efficiency
4, 2, 4096
Figure 7.10: Continuous Efﬁciency for versus
840.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
pnet
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
pdes
1/1024
1/256
1/64
1/6
1/4
1/2
1.0
efficiency
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
pdes
1/1024
1/256
1/64
1/6
1/4
1/2
1.0
efficiency
4, 2, 4096
Figure 7.11: Continuous Efﬁciency for versus (Log Scale)
85Ideally, we would like to match the programmable network connectivity to the design connec-
tivity. Unfortunately, we do not generally get that choice. Figures 7.7 through 7.11 show us that
it is just as inefﬁcient to provide too much interconnect for a design as it is to provide too little.
Thisis important to notice, since there is a tendencyto demandrichinterconnect that provideshigh
gate utilization across all designs. However, since the non-interconnect area is trivial compared to
network area in FPGA devices, optimizing for gate utilization is often short sighted.
As a ﬁnal, illustrative example, let us consider the task of picking the network connectivity,
, assuming that we know typical designs will have a between 0.4 and 0.8. Figure 7.12
shows the overheads for values of 0.4, 0.58, and 0.8 as a function of , respectively.
If we further assume that the design Rent exponents are evenly distributed in this range, we can
calculate an expected overhead:
overhead
0 8
0 4
overhead
Figure7.13plotsthisexpectedoverheadfortheidentiﬁedrange. We see thattheexpectedoverhead
is quite ﬂat between =0.5and 0.6 with an expectedoverheadof justover 2 . At the endsof the
spectrum, the expected overhead is 8 worse. Note, in particular, if we chose to build 0 8
in order to guarantee full utilization of every LUT, we would pay a 16 overhead on average, and
a 56 overhead in the worst case. In contrast, choosing 0 58 has a worst-case overhead of
4.2 and an average overhead of 2.2 .
86|
0.40
|
0.50
|
0.60
|
0.70
|
0.80
| 0
| 10
| 20
| 30
| 40
| 50
| 60
| 70
| 80
| 90
 
O
v
e
r
h
e
a
d
 
R
a
t
i
o
 pnet=0.40
 pdes
|
0.40
|
0.50
|
0.60
|
0.70
|
0.80
| 0
| 1
| 2
| 3
| 4
| 5
 
O
v
e
r
h
e
a
d
 
R
a
t
i
o
 pnet=0.58
 pdes
|
0.40
|
0.50
|
0.60
|
0.70
|
0.80
| 0
| 10
| 20
| 30
| 40
| 50
| 60
 
O
v
e
r
h
e
a
d
 
R
a
t
i
o
 pnet=0.80
 pdes
4, 2, 4096
Figure 7.12: Sample versus Overheads
87|
0.40
|
0.50
|
0.60
|
0.70
|
0.80
| 0
| 2
| 4
| 6
| 8
| 10
| 12
| 14
| 16
| 18
| 20
| 22
| 24
| 26
 
E
(
O
v
e
r
h
e
a
d
 
R
a
t
i
o
)
 pnet
4, 2, 4096
Figure 7.13: E(overhead) versus for Uniform Distribution
887.8 Interconnect Description
We can also ask how the requirements for interconnect description will grow. Trivially, we
knowthatit willgrow no fasterthanthenumberof switchescomposingtheinterconnect. However,
it can actually grow much slower. We start (Section 7.8.1) by using the full-connectivity model
of the crossbar to establish an upper bound on the necessary interconnect description length. We
then continue (Section 7.8.2) using the Rent’s rule based hierarchical interconnect, as in previous
sections, to derive a tighter approximation. By either metric, we see that the instruction sizes for
conventional FPGAs are signiﬁcantly larger than necessary. This observation suggests that context
memory area and reload instruction bandwidth can be signiﬁcantly reduced by judicious coding
(Section 7.8.3).
7.8.1 Weak Upper Bound
Assuming that the network may be arbitrarily connected, we can count the number of possible
interconnection patterns to get an upper bound on the number of interconnection bits which can
be usefully employed describing the input to each LUT. We start by assuming we have a device
composed of:
-input lookup tables
inputs to the network (from the chip i/o)
outputs from the network (chip outputs and enables)
Each LUT input can come from any of the other LUT outputs ( ) or any of the inputs. We
can encode the source selection for a single input:
log2 7 14
Since a LUT has inputs, the total number of interconnect bits needed is simply:
log2 (7.15)
e.g. A 1000 4-LUT device with 200 inputs would require only 41 bits (44 if we encode each input
separately) to specify each LUT’s interconnect. A 9000 4-LUT device with 600 inputs requires
only 53 bits (56 for separate input encodings).
If our functional elements are truly 4-LUTs, then this upper bound can be tightend by noticing
that we gain no additional functionality by being able to route a particular source into the LUT
multiple times and the assignment of the sources to LUT inputs is inconsequential. With this
observation, we really only need to choose items from when specifying the LUT
interconnect. This gives:
7 16
89e.g. our 1000 4-LUT device with 200 inputs requires only 37 bits and our 9000 4-LUT device with
600 inputs requires only 49 bits. Here we save an additional 4 bits per 4-LUT. Asymptotically:
lim lim log2
lim log2 1 2 3
!
1 2 3
lim log2
!
log2 ! (7.17)
So, we expect that exploiting the equivalence of the inputs on a -LUT to save us log2 ! bits
from the number of bits required for full interconnect. For 4, this amounts to a savings of
4-5 bits per LUT.
Commercial devices are not purely composed up of LUTs, but we can draw a box around their
basic programming elements and use the above counting arguments to get a loose upper bound
on the number of interconnect programming bits they could require. Table 7.4 shows parameters
for each of several commercial device families along with a pedagogical reference. Using the
parameters given in Table 7.4, we can use the full connectivity assumption to compute an upper
bound on the network description length:
log2 ( ) 7 18
Table 7.5 calculates for each of the device families from Table 7.4 and contrasts
these numbers with the number of actual device bits per basic element. The comparison is
necessarily crude since vendorsdo not provide detailed informationon their conﬁguration streams.
However, we expect the unaccounted control bits in Table 7.5 to not be more than 10% of the total
bits per block. With this expectation, we see that the commercial devices exhibit a factor of two
to three more interconnect conﬁguration bits than would be required to provide full, placement-
independent, interconnect of the logic blocks.
7.8.2 Structure Based-Estimates
The upper bound derived in the previous section assumed full connectivity of the network.
However, the network is generally much more restricted. The restrictions imply a smaller class of
realizable connection patterns and fewer requisite interconnect bits. In this section we return to
our pedagogical, hierarchical interconnect from Section 7.6. For small Rent exponents, , we can
derive tighter bounds.
90Family
Xilinx 2K CLB 4 2 2 1 16 100 74
Xilinx 3K CLB 9 2 2 2 32 484 176
Xilinx 4K CLB 13 4 4 2 40 1024 256
Xilinx 5K CLB 16 8 2 1 64 484 244
Altera 8K LE 4 1 1 1 16 1296 208
Orca 2C PFU 19 6 3 1 64 900 480
UTFPGA Tile 11 4 2 1 48 700 256
LEGO Tile 15 4 2 1 64 500 256
DPGA 4-LUT 4 1 2 1 16 144 48
Reference 4-LUT 4 1 2 1 16 2000 256
- assumed value
number of inputs per basic logic element
number of outputs per basic logic element
number of bits specifying the logic function per basic element
number of inputs from each i/o element
number of outputs to each i/o element
number of blocks in largest member of family
number of i/o elements in largest member of family
Table 7.4: Parameters for a Sampling of Contemporary Programmable Devices
Part Logic Bits Net Bits Actual Bits (approximate)
Xilinx 2K 16 33 160
Xilinx 3K 16 94 190
Xilinx 4K 40 159 420
Xilinx 5K 64 193 510
Altera 8K 16 43 190
Orca 2C 64 247 480
UTFPGA 48 128 146
LEGO 64 167 492
DPGA 16 31 40
Pedagogical Reference 16 45 –
Table 7.5: Conﬁguration Bits – Requirement Upper Bound v/s Actual
91 Rent n=2 
 Rent n=4 
 Rent n=8 
 Full Crossbar
 K-LUT Equivalence
|
0.00
|
0.10
|
0.20
|
0.30
|
0.40
|
0.50
|
0.60
|
0.70
|
0.80
|
0.90
|
1.00
| 0
| 10
| 20
| 30
| 40
| 50
| 60
| 70
| 80
| 90
| 100
 Rent Exponent (p)
 
N
e
t
w
o
r
k
 
B
i
t
s
/
L
U
T
 NLUT=4096
Figure 7.14: Network Bits per LUT v/s Rent Exponent for 4096 (K=4)
Reconsidering,thehierarchicalinterconnectstructurefromFigures7.3and7.4,wecancalculate
the number of bits required per level of the hierarchy.
log2
( 1) 1
1
1
log2 !
1
1
7 19
log2
1 1
7 20
Table 7.6 summarizes these values by level for 2
3, along with the number of bits according to
the earlier crossbar and K-LUT equivalence calculations. This scheme also gives only an upper
bound since the individual treatment of the permutations within each level counts more distinct
combinations than actually exist. For moderate values of , though, this will give a tighter bound
than the crossbar bound derived in the previous section.
The number of bits required will vary with the Rent exponent . Figure 7.14 shows this
variation. It also shows the relationship among choices of the arity of the hierarchy, , and the
crossbar and K-LUT equivalence bounds. Figure 7.15 shows the growth rate versus the number of
LUTs for several Rent exponents and the two crossbar bounds.
Note that Donath performs a similar calculation in [Don74]. He uses a more restrictive
interconnect model. Using 2- to 3-input, single-function gates, he calculates 7-10 bits of memory
per 0 5 to 0.8. In Donath’s model, the required description bits does not grow with network
size.
7.8.3 Signiﬁcance and Impact
Resources for instructionstorage and distribution, as the next Chapter(Chapter 8) will address,
can take up signiﬁcantarea and playa big role in the characteristicsof an architecture. Notably, the
size of the instructiondetermines thesize of the instruction store onand off chipand the bandwidth
required to load new instructions.
92(integer) (integer)
1 4 1 0 0 0.0 0.0 0 0 0.00
2 7 2 12 6 0.0 5.6 6 6 5.63
4 11 4 15 10 0.0 5.1 6 12 10.75
8 17 8 18 13 0.0 4.5 5 17 15.23
16 26 16 21 16 0.0 3.8 4 21 19.06
32 41 32 24 20 0.0 3.3 4 25 22.37
64 65 64 28 23 0.0 2.9 3 28 25.22
128 102 102 38 27 0.7 2.4 4 32 28.36
256 162 162 40 31 0.6 1.9 3 35 30.87
512 257 257 43 34 0.5 1.6 3 38 32.89
1024 407 407 46 38 0.4 1.2 2 40 34.49
2048 646 646 49 41 0.3 1.0 2 42 35.76
4096 1025 1025 53 44 0.2 0.8 2 44 36.78
8192 1626 1626 56 48 0.2 0.6 1 45 37.58
16384 2581 2581 59 52 0.1 0.5 1 46 38.22
0 67 (Rent Parameter); 4 (K-LUT); 2 (2-ary hierarchy)
Total number of LUTs in level
number of inputs to level
number of outputs from level
number of bits per LUT for full crossbar interconnect
number of bits per LUT exploiting K-LUT input equivalence
number of bits per LUT to describe up connections this level
number of bits per LUT to describe down connections this level
(int) integer number of bits per LUT to describe this level interconnect
(int) bits per LUT to describe interconnect to this level with integer rounding
total number of bits per LUT to describe interconnect to this level
Table 7.6: 4-LUT in 2-ary Hierarchical Interconnect with 2
3
The bounds we derived in the previous sections show that the instruction sizes in traditional
FPGAsarehigherthannecessary, atleastbyafactorof2-4 . Forsinglecontextdevicesaswehave
seen, instruction memory makes up only a small fraction of the area on a conventional FPGA. For
this reason, these bloated instructions do not adversely affect FPGA cell area (See Figure 7.16). In
fact, in wire limitedregimes, they may helpby localizing instructionbits to the values theycontrol.
Themostsigniﬁcantimpactisonreconﬁgurationtime. Smallerinstructionsmeanwecanreload
instructions in less time, given the same bandwidth for instruction reload. Alternately, it means
that correspondingly less resources can be dedicated to instruction distribution in order to achieve
the same instruction reload time as the larger instructions.
93 Rent p=0.50 
 Rent p=0.67
 Rent p=0.75
 Full Crossbar (p=0.67)
 K-LUT Equivalence (p=0.67)
|
1 |
2 |
4 |
8 |
16 |
32 |
64 |
128 |
256 |
512 |
1024 |
2048 |
4096 |
8192 |
16384
| 0
| 10
| 20
| 30
| 40
| 50
| 60
| 70
 
N
e
t
w
o
r
k
 
B
i
t
s
/
L
U
T
 NLUT
Figure 7.15: Network Bits per LUT v/s Number of LUTs for 2 (K=4)
As we begin to make heavy use of the reconﬁgurable aspects of programmable devices, device
reconﬁguration time becomes an important factor determining the performance provided by the
part. Intheserapidreusescenarios,instructionsizecanplayasigniﬁcantroleindeterminingdevice
area and performance.
1. Off-chip contextreloads forsingle- or multi-contextdevicesareslow becausea large amount
of conﬁguration data (typically, 105 bits) must be transfered across a limited bandwidth
i/o path. Reducing the size of the instructions transmitted across the i/o will improve reload
performance.
2. One technique for reducing the reconﬁguration time is to store multiple, on-chip contexts.
When we start replicating the instructions associated with each LUT, the relative area con-
sumed by instruction memory increases, making economy of instruction encoding more
important (See Figure 7.17).
7.8.4 Instruction Growth versus Interconnect Growth
From the previous sections, we have seen that interconnect requirements grow faster than
interconnect description requirements. Speciﬁcally:
For 0 5, the number of switches and the amount of wiring grow as
2 1 per LUT.
The number of interconnect conﬁguration bits grows at most as log .
We already see that the switch and wire resources occupy a signiﬁcantly larger fraction of the are
per LUT than the interconnect description (Section 7.1.2). As gets large, the size disparity
will grow.
This is one reason that single context devices can afford to use sparse interconnect encodings.
Since the wires and switches are the dominant and limiting resource, additional conﬁguration bits
are not costly. In the wire limited case, we may have free area under long routing channels for
memorycells. Infact, denseencodingoftheconﬁgurationspacehas thenegativeeffect that control
signals must be routed from the conﬁguration memory cells to the switching points. The closer
94Interconnect
Interconnect
Configuration
      Logic
Configuration
Interconnect
Interconnect
Configuration
For single context devices, the savings potential from denser interconnect description
encodings is small – maybe 5-10%.
Figure 7.16: Single Context FPGA Area
Interconnect
Interconnect
Configuration
      Logic
Configuration
Interconnect
Configurations
Interconnect
For multicontextdevices, thesavingspotentialfromdenserinterconnectdescriptionencod-
ings can be large – up to 50-75% as the number of conﬁgurations get large.
Figure 7.17: Multicontext FPGA Area
95we try to squeeze the bit stream encoding to its minimum, the less locality we have available
between conﬁguration bits and controlled switches. These control lines compete with network
wiring, exacerbating the routing problems on a wire dominated layout.
967.9 Effects of Interconnect Granularity
So far, we have looked entirely at single-bit level granularity networks and designs. In this
section, we look at how the multi-bit designs and networks effect the relations we have already
developed.
In general, we will assume a -bit datapath with total bit processing elements. Groups
of bit processing elements will act as a single compute node. We thus have such
compute nodes.
7.9.1 Wiring
We look at the wiring requirements, as we did before, by looking at the bisection bandwidth
implied by the network. Assuming Rent’s rule based hierarchical interconnect, at the top level we
have i/o busses of width . This makes for a total bisection bandwidth:
(1 )
This makes the wire dictated area growth go as:
2 2 2(1 )
Per LUT this makes:
(2 1) 2(1 ) 7 21
Notice that this is actually larger than the interconnect wiring area required for single-bit intercon-
nects (Equation 7.4).
This result makes the assumption that the bits composing a node are tightly interconnected
or otherwise coupled such that the minimum bisection occurs between tree levels as before. If the
bits in a node were not interconnected in any way, the network could be decomposed into
single bit networks. In such a case, the size would simply be times the size of an single
bit network, making:
2
(2 1) (1 2 ) (7.22)
Equation 7.22 implies the area is actually smaller than the single bit network for 0 5. In
practice the earlier result (Equation 7.21) is most realistic.
One issue this raises is that a -bit design with Rent exponent implementedon top of a single
bit network will require more interconnect per level than a single bit design with Rent exponent .
Using the same technique as in Section 7.7, we can solve for the required scale up factor:
( )
(1 ) ( )
(1 )
1 1 (7.23)
977.9.2 Switches
Switchingrequirements,in contrast,diminish withincreasing sincelessﬂexibilityis required
of the network with a given number of bit processing elements. We can derive the switching
requirements by substituting in for in Equation 7.11 then multiplying by the datapath
width:
2 ( 1 2 )
(1 2 )
(1 2 ) 1 0 5
log 2 1 2 0 5
(2 1) (1 2 ) 2 1
2 1 1 ( 1 2 ) 0 5
7 24
For large , wiring requirements will asymptotically dominate switching requirements in a bussed
interconnect scheme.
987.10 Summary
This sectionfocussedoninterconnect. We establishedviaanempiricalreview thatinterconnect
makesup thedominantarea and delay in conventional FPGAs. We thenwent on to look atnetwork
design issues. We established basic relations governing the interconnect requirements in terms of
network size and wiring complexity. In the processes we showed that it is not always best to build
the network with sufﬁcient interconnect to accommodate the most heavily interconnected designs
at full gate utilization; rather, since interconnect is the dominant area contributor, more efﬁcient
areautilizationcan be achievedwith networkswith lower interconnectcomplexity. We also looked
at interconnect description requirements, noting that interconnect descriptions grow more slowly
than wire and switching requirements. Here, we pointed out that conventional devices use sparse
interconnect description encodings, using, at least a factor of 2-4 more conﬁguration bits than
necessary; this observation suggests that we have an opportunity to reduce the area required to
hold descriptions in multicontext devices and the bandwidth required for conﬁguration reload in
singleormulticontextdevices. Finally,we lookedathowtheinterconnectsizerelationshipschange
with wider word operations and saw that greater word widths increase wiring requirements while
decreasing switching requirements.
998. Instructions
The need for instructions to control device operation is one distinguishing feature of general-
purpose computing devices. These instructions give general-purpose devices their ﬂexibility to
solve a variety of problems. At the same time, instructions require dedicated resources for storage
and delivery.
General-purposecomputingarchitecturesmustaddressanumberofimportantquestions:
1. How are general-purpose processing resources controlled?
2. How much area is dedicated to holding the instructions which control these resources?
3. How many resources are controlled with each instruction?
4. How much bandwidth is provided for instruction distribution?
5. How frequently can instructions change?
There are many, different possible answers to these questions and the answers, in large part,
distinguish the various general-purpose architecture categories which we reviewed in Chapter 4
(e.g. word-wide uniprocessor, SIMD, MIMD, VLIW, FPGA, reconﬁgurable ALU). The answers
also play a large role in determining the efﬁciency with which the architecture can handle various
applications.
In this chapter, we look at the problem of instruction control and the resources involved. We
startbylookingat theextreme casewhereevery bit operationis givena uniqueinstructionon every
cycle. This example illustrates that instruction distribution resource requirements can be quite
large – dominating other areas in a device. To combat these requirements, traditional architectures
have placed various, stylized restrictions on instruction distribution in order to contain its resource
requirements. Each of these restrictions also limits the realm of efﬁciency of the architecture.
We review these restrictions and their effects on device utilization efﬁciency in Section 8.3. Of
course, the opportunity to compress instruction distribution requirements depends on the inherent
compressibilityoftheinstructionstreamsuggestingthatsomecomputationswillremaindescription
limited while others are compute limited (Section 8.4). We also look at the issue of instruction
stream control (Section 8.5). Finally, Section 8.6 organizes the architectural parameters reviewed
in this chapter into an expanded taxonomy for multiple data processing architectures.
8.1 General Case Example
Consider that we have bit processing elements. Each of these elements may be a 4-LUT
as in the previous section or a one bit ALU. We want to provide a different instruction to each
processing element on every cycle of operation. From Section 4, we see that 100 MHz+ operating
frequencies are readily achievable today, with many devices already achieving higher frequencies.
We will consider a 200 MHz operating frequency as one that will be easily achievable in the very
near future. We saw from Section 7.8 that each 4-LUT needs 40-50 bits to describe its network
conﬁguration and 16 bits to control the logic function. We will thus assume each LUT requires a
64-bit instruction to control it.
100Let us further assume that we distribute the instruction, one per clock cycle, from all four
sides of the array of processing elements densely using 2 layers of metal with an 8 wire
pitch. Of course, the assumption that we dedicate two full metal layers to instruction distribution
is extreme, but even making this best case assumption, we will see that the resources required for
full instruction control can dominate all other concerns. For the sake of easy comparison, we will
target a 1024 1024 1M 2 array element, which is on par with large, conventional 4-LUT
FPGA implementations (Table 7.1).
Acrossthewidthofoneprocessingelement,wecanruninstructiondistributionbussestocontrol
two such elements:
2 64
bits/LUT
8
wire pitch
1024
This means, we can support an array with 2 as many processing elements ( ) as it has edge
widths. That is:
2 4
From which we conclude 64.
At this point, we have fully saturated the i/o bandwidth into the compute array. Any further
increasein the numberof elementssupported in thearray must beaccompanied by a corresponding
increase in LUT size. That is:
64 8
4
128
16384 2 (8.1)
Consequently,LUTareaincreaseslinearlywiththenumberofprocessingelementssupported inthe
array. The instruction distribution bandwidth requirement is the dominante size effect determining
the density with which computational elements can be built. Notice that the LUT area growth
rate dictated by instruction bandwidth is larger than the interconnect growth rate for any value of
1 and, ultimately, both interconnect and instruction distribution compete for the same limited
resource – wire bandwidth.
Further, we can calculate the actual instruction bandwidth requirements. At 200 MHz and
64-bit instructions, each LUT requires 1.6GBytes/s of instruction distribution. For the 64
case above, this amounts to over 100GBytes/s.
Thiskindofbandwidthcouldnot,reasonably,besupportedfromoffchipwithanycontemporary
technology, necessitating on-chip instruction memory. At 1000 2 per SRAM cell, 16 64-bit
instructions will occupy the same space as each 1M 2 processing element. If more than 16 unique
instructionsetsarerequired,instructionmemorywilloccupymoreareathantheprocessingelements
and interconnect.
1018.2 Bits per Instruction
In the previous section we assumed 64 bits per instruction. This seems to be a reasonable,
ballpark estimate of the number of bits required to describe a single bit operation, including
interconnect.
Processors Modern processors generally employ 32 bits per instruction. However, as we saw in
Section 4.1, about half of the instructions issued by a microprocessor are interconnect operations.
Particularly, when we looked at gate evaluations, we saw that each processor instruction describes,
on average, about 0.5-0.6 gate evaluations.
FPGAs Modern FPGAs use 120-200 bits per 4-LUT. We pointed out in the previous chapter that
this was due to non-sparse encoding, and much denser encodings were possible. Simply using the
crossbar bound from Section 7.8.1, we see that we can handle a 4000 4-LUT device with 48 bits of
interconnect description and 16 bits of logic description. VEGA, a heavily multicontexted FPGA,
with 85 bits per instruction, comes closer to this range [JL95].
1028.3 Compressing Instruction Stream Requirements
Section 8.1 showed us that we cannot afford to have full, independent, cycle-by-cyle control
of every bit operation without instruction storage and distribution requirements dominating all
other resource requirements. Consequently, we generally search for application characteristics
which allow us to describe the computation more compactly. In this section, we review the most
common techniques generally employed to reduce instruction size and bandwidth. We see that
every architecture reviewed in Chapter 4 exploits one or more of these compression techniques.
8.3.1 Wide Word Architectures
Processors do not, commonly, operate on single bit data items. Rather, sets of bit elements
( 8 16 32 64 )aregroupedtogetherandcontrolledbyasingleinstructioninSIMDstyle. This
has the effect of reducing instruction bandwidth requirements and instruction storage requirements
by a factor of . This compression scheme takes advantage of the fact that we commonly do want
to operate uniformly on multibit quantities. We can, therefore, effectively amortize instruction
resources across multiple bit processing elements.
Returningtoouropeningexample,wecansupport 2moreprocessingelementsbeforereaching
the same point of wire saturation:
2 4
8
64 2
The utilization efﬁciency of the resulting architecture depends on the extent to which all
operations are -bit operations, or even multiples thereof. When smaller operations, , are
required, bit processing units will sit idle while only units provide useful work.
8.3.2 Broadcast Single Instruction to Multiple Compute Units
SIMD and vector machines take this instruction sharing one step further. They arrange so that
multiple functional units operating on nominally different words share the same instruction. This
allows them to scale up the number of bit operators without increasing the word granularity or
instruction bandwidth. It does, however, increase the operation granularity. To remain efﬁcient
now, the application requires -bit operations, where , is the number of word-wide datapaths
controlled by each instruction.
8.3.3 Locally Conﬁgure Instruction
Reconﬁgurable architectures, such as FPGAs, take advantage of the fact that little instruction
bandwidth is needed if the instructions do not change on every cycle. Each bit processing element
gets its own, unique, instruction which is stored locally. However, this instruction cannot change
from cycle to cycle. A limited bandwidth path is used to change array instructions when necessary.
There are two viewpoints from which to approach the efﬁciency of this restriction:
1031. Frequency of Instruction Change – Given a lower bandwidth path, , and compute
elements with long instructions, each context reload will take . If the inter
arrival time between reloads is , the efﬁciency of operations is equal to the fraction of
time spent computing versus total compute and reload time:
Efﬁciency
This, of course, assumes that every processor is doing useful work on each cycle.
2. Task Critical Path Length – Alternately, if we assume the computing array is sufﬁciently
large to performthetask,theefﬁciencyis thefractionofcomputeelementsperforminguseful
work on each cycle. If the device has processors and must perform a task with
which has a critical path of length , then the efﬁciency is the number of useful bit
operations divided by the total number of processors and the critical path length:
Efﬁciency
Forthefrequencyofchangecase,partialreconﬁgurationcanreducethereloadtimebyallowing
individual processing units to change instructions without requiring an entire reload of all instruc-
tions in the array. This can increase reload efﬁciency if a large fraction of the instructions does
not change. Partial reconﬁguration can also allow portions of the array to change their instructions
while other portions of the array continue to operate. Modern FPGAs from Plessey [Ple90], Atmel
[Atm94], and Xilinx [Xil96] support partial reconﬁguration for these reasons.
8.3.4 Broadcast Instruction Identiﬁer, Lookup in Local Store
A hybrid form of instruction compression is to broadcast a single instruction identiﬁer and
lookupits meaninglocally. This allows us to use a moderatelyshort, single “instruction”across the
entire array in a manner similar to SIMD instruction broadcast. Each processing element performs
a locallookupfrom thebroadcast instructionidentiﬁerto generate a full lengthinstruction. DPGAs
(Chapter 10), PADDI [CR92], and VLIW machines with an independent cache for each functional
unit, exhibit this kind of hybrid control.
This technique is similar to a dictionary compression scheme where the set of entries at each
“instruction”addressmakesuponeelementinthedictionary. Theinstructionaddressistheencoded
symbolwhich can now be transmittedinto the array with minimal bandwidth. The key beneﬁt here
is that the parallel instructionsets can be tailoredto each application in the sameway the dictionary
can be tailored to a message or message type.
Efﬁciency in this scheme is similar to the task critical path length case above. The difference
being,thateachsingleprocessorneednotbededicatedtoasingleinstruction. Withlocalinstruction
storeseach holding instructions,an array of processingnodes can perform upto
different bit operations. In the single instruction conﬁguration case, a critical path of 1
implied a peak, achievable efﬁciency of 1 . In this case, the peak, achievable efﬁciency is
min 1 .
104Viewed in terms of instruction change frequency, the additional local conﬁgurations can serve
as a cache, diminishing the need to fetch array instruction from outside the array. Like a cache
if each instruction set required is unique, it will provide no beneﬁt. However, when an array
instruction can be kept in the array and used several times before being replaced, we reduce the
required instruction bandwidth. Use of a loaded instruction can occur at the operational cycle rate
rather than at the bandwidth limited reload rate.
With multiple conﬁgurations it is possible to arrange for instruction reload to occur as a
background task operating in parallel with operation. If the reload time, , is less than
the run time within the balance of the loaded instruction memory, , and the next
instructioncan be predicted sufﬁciently in advance, reload time can be completely overlapped with
computation.
8.3.5 Encode Length by Likelihood
Since it is unlikely that all instructions will be used with equal frequency, one can break
instructions into a series of smaller words, giving common instruction short encodings. If we
huffmanencodeourinstructionintoaseriesof -bitwords,wecan,potentially,reducetheinstruction
distributionbandwidth by a factor of log2 instructions ; that is, 64 if we assume64 bit instructions.
The efﬁciency now depends on the expected number of -bit words required to construct a
single, logical instruction. If the instruction stream entropy is low, this kind of encoding can
be very efﬁcient — asymptotically approaching one symbol per instruction. Counterwise, if the
instruction stream entropy is high – or even ﬂat, it may take a full
log2 instructions cycles to
built up a single instruction. Worse, if the instruction frequency is substantially different from the
instruction frequency for which the encoding was optimized, it can actually take more cycles, on
average, to build an instruction. Of course, the huffman encodings could be variable, as well, to
avoidthis mismatch, but the space required to handle programmable huffman encodingswill likely
exceed the area of several computational units.
8.3.6 Mode Bits for Early Bound information
All of the bits in an instruction do not always need to change at once — or portions of an
instruction may change at different rates. Rather than include the infrequently changing portions
of the instruction in the word which is broadcast from cycle to cycle, these portions can be
factored out of the broadcast instruction and explicitly loaded with new values only when they
need to change. This allows us to describe richer instructions with less bandwidth. These locally
conﬁguredinstructions can be seen as a special case of the previoussection on likelihood encoding
— but in this case we exploit the low frequency of change rather than simply the low frequency of
occurrence of some instructions.
Mode bits such as these are used to deﬁne operational modes in several architectures. Floating
point coprocessors often use mode bits to deﬁne rounding modes. Segmented SIMD architectures
such as Abacus [BSV 95] and the dynamic computer groups of [KK79] use mode bits to deﬁne
segmentation of the SIMD datapaths.
Bandwidth savings depends on the number of bits factored out of the broadcast instruction
stream. Efﬁciency depends on the frequency with which non-broadcast instruction values need to
105change. Typically, it takes an instruction cycle to load each mode value – which is an instruction
cycle which does not serve a purpose towards execution.
8.3.7 Themes
Two major themes emerge from the techniques listed here:
1. Granularity–Howmanyresourcesarecontrolledbyaeachinstruction? Fromaresourcecost
standpoint, this is the motivation behind word-wide datapaths, SIMD, and vector processors.
processing
2. Local Conﬁguration Memory – How many instructions are stored locally per active com-
puting element? Similarly, this is the motivation behind conﬁgurable architectures and local
memories.
Inthe nextchapter, we willlook effects whichthese techniqueshaveboth on resource requirements
and on utilization efﬁciency.
1068.4 Compressibility
Ofcourse,wecanonlysucceedincompressingtheinstructionbandwidthwhenthereisstructure
tothetaskdescriptionforustoexploit. Ifthetaskdescriptivecomplexityreallyisaslargeasimplied
in Section 8.1, we are instruction bandwidth limited, and instruction distribution does determine
achievable, computational density.
This suggests we have two extremes in the characterization of computing tasks:
1. Descriptive Complexity Limited – the instruction bandwidth to describe the computation
limits the rate of execution.
2. Compute Limited – the active computing elements performing the required computation
limit the rate of execution.
Regulartaskssuch assignal andstreamprocessing, systoliccomputations,andcomputationalinner
loops are typically compute limited. Irregular, run once, tasks such as initialization, cleanup, and
exception handling are typically descriptive complexity limited. Of course, applications tend to
have a mix of both elements. It has long been observed that only a small fraction of the code in a
typical application accounts for most of the computational time [Knu71]. The regions composing
this small fraction are heavily reused, allowing the computation to be described compactly. The
code outside of the heavily used fraction, does not beneﬁt from the heavy reuse amortization and
will tend to be more description limited.
As with most compression schemes, the amount of compression achievable, in practice, also
depends heavily on the frequency of repetition and storage space available. For example, if a task
performs a sequence of one million, unique operations, then restarts the sequence, the stream is
very repetitive, and an inﬁnite sequence of such such repetitions contains a constant amount of
information. However, unless we have space to hold all one million instructions on chip, we will
not be able to take advantage of this regularity and low information content in order to compress
instruction bandwidth requirements. Further, holding one million instructions on chip is a large
cost to pay for instruction storage, even by today’s standards.
1078.5 Control Streams
In Sections 8.1 and 8.3, we viewed the set of processing elements as having a single, large,
array-wide, instruction. In general, the array-wide instruction context may be decomposed into
a number of independent instruction streams. This decomposition does not change the aggregate
instructionbandwidthwhichmayberequiredintothearray, butitmaychangethenumberofdistinct
contexts used by the array and hence the requirements for instruction distribution and storage.
Let us assume, as in the case of Section 8.3.4, that each processing element has a local store
of instructions. Let us also assume we have a series of independent tasks, each composed
of at most instructions. The total number of distinct, array-wide contexts may be as large as
( ) , since the tasks are independent and any combination of instructions is possible. If each
ofthe tasksis controlledseparately, we need only instructionsto describeand control
the tasks. If we must control the tasks with a single instruction stream, that stream requires all
( ) contexts and hence a larger number of instructions, ( ) , are required.
This example demonstrates that there is a control granularity which is a distinct entity from
the operation granularity introduced in Sections 8.3.1 and 8.3.2. As with operation granularity, we
can compress instruction control requirementsby sharing the control among a number of operating
units. However, if we control too many units by the same control stream, we are forced to use the
deviceinefﬁciently. Intheworstcase,wemaypayanefﬁciencyorcompactionpenaltyinproportion
to the product of the instruction sets of the independent operations which must be combined into a
single control stream.
The separate streams of control are, of course, what distinguishes MIMD architectures (Sec-
tion 4.9), as well as MSIMD (e.g. [Bri90, Nut77]) or MIMD multigauge [Sny85] architectures.
108Control Threads (PCs)
Instructions per Control Thread
Instruction Depth
Granularity
Architecture/Examples
0 0 n/a Hardwired Functional Unit
0 (e.g. ECC/EDC Unit, FP MPY, Hardware Systolic)
1 FPGA, Programmable Cellular Automata
1 Reconﬁgurable ALUs
Programmable Systolic Datapath Arrays
1 Bitwise SIMD
1 Traditional Processors
Vector Processors
1 1 DPGA
8 16 PADDI
VLIW
1 MSIMD
1 VEGA
1 8 16 PADDI-2
MIMD (traditional)
Table 8.1: Instruction Control Taxonomy
8.6 Instruction Stream Taxonomy
Table 8.1 categorizes the various architectures we have reviewed in Chapter 4 according to the
granularity ( , ), local instruction storage depth ( ), number of distinct instructions per control
thread ( ), and number of control threads ( ) supported. This taxonomy elaborates the multiple
data portion of Flynn’s classic architecture taxonomy [Fly66] by segregating instructions from
control threads and adding granularity.
1098.7 Summary
In this section, we have seen that the requirements for instruction distribution and storage can
dominateallotherresourcesongeneral-purposecomputingdevices,dictatingthesizeanddensityof
computingelements. Amajordistinguishingfeatureofmodern,general-purposearchitecturesisthe
way in which they compress the requirements for instruction control. Traditional microprocessors,
SIMD, and vector machines reduce the requirements by sharing a single instruction across many
bits or words. FPGAs and programmable systolic arrays reduce requirements by maintaining the
same instruction from cycle to cycle. VLIW-like architectures use small, local instruction stores
addressed by short addresses so that limited instruction distribution bandwidth can effect cycle-
by-cycle changes in non-uniform instructions. Each of the techniques used to reduce instruction
control resources comes with its own limitations on achievable efﬁciency should the needs of the
application not meet the stylized form in which the instruction bandwidth reduction is performed.
Some instruction sequences are more compressible than others, suggesting we have a continuum
of task descriptive complexities such that some tasks are, by nature, instruction bandwidth limited
whileothersareparallelcomputingresourcelimited. In thischapterwe reviewedboth thenature of
theresourcereductionsandtheefﬁciencylimitswhicharisefromthesetechniques. Inthefollowing
chapter, wewill combinethese effects withour sizeand growthobservations fromChapters4and7
to model the size and efﬁciency of reconﬁgurable computing architectures.
1109. RP-space Area Model
In this chapter, we put together the sizings from Chapter 4 and 7, the growth rates from Chapter 7,
and the instruction requirements from Chapter 8 to form a uniﬁed area model for RP-space, a
large class of reconﬁgurable processing architectures. The area model gives us a ﬁrst order size
estimateforreconﬁgurablecomputingdevicesbasedonthekeyparametersidentiﬁedintheprevious
chapters. We use this model to estimate peak computational density as a function of granularity
and on-chip instruction store sizes. We also use it to characterize the way computational efﬁciency
decreases as application granularity and path lengths differ from the architecture’s optimal points.
9.1 Model and Assumptions
We assume an array of homogeneous, general-purpose processing elements. For pedagogical
purposes, no special-purpose processing units are included. The area for each bit processing
element is taken to include:
Fixed area for the computational function
Amortized storage space for instructions
Storage space for data
Space for interconnect resources
Amortized space for control
We compute the area per bit processing element as:
interconnect instruction memory
data memory
control area
(9.1)
Table 9.1 summarizes the parameters used in Equation 9.1.
1200 2 is typical of static memory, which we will assume here. Memory cells
packed into large arrays are likely to be denser, on average, than small arrays or isolated memory
cells. Dynamic memory cells may be a factor of four smaller in large arrays, where appropriate.
Equation 9.1 assumes that interconnect area is proportional to the number of switches. In
Sections 7.6 and 7.9, we saw that switch growth rates match or determine interconnect growthrate.
In Section 7.9, we did see that wiring might dominate switch growth for large , which is not
accounted by Equation 9.1. 2500 2 is a constant of proportionality intended to match the
number of switches to the empirical interconnect areas typically seen rather than a model of any
particular interconnect geometry. Table 9.2 summarizes the number of switches as a function of
111Assumed
Parameter Role Value
Area per bit processing element
Fixed area per compute element 20K 2
(LUT mux, output ﬂip-ﬂop, buffers)
Datapath width – number of bit elements
controlled by one instruction
Contexts – number of instructions stored
per group of processing elements
Total number of instruction or data contexts
addressed by controller
Number of bits in each instruction 64
Area of a conﬁguration or data memory cell 1200 2
Number of bit processing elements in the array
Number of switches per bit processing element [Eq. 7.24]
Tree arity in modeled hierarchical interconnect 2
Number of LUT inputs 4
Rent parameter for network 0.5
Amortized area of each switch 2500 2
Number of data bits per bit processing element
Number of independent stream controllers
Area of instruction stream controller 0 3M 2 log2 ( )
Table 9.1: Summary of Area Model Parameters
and for 0 5, as will be used here. This is the same data which was plotted in Figure 7.5;
for 0 5, the only difference is that we use as the network size when determining (See
Equation 7.24).
For devices with multiple contexts, a controller manages the selection and sequencing of
instructions in the array. The area we use for is a rough estimate based on a sampling of
processorimplementations(SeeTable9.3). We assumethattheareainthecontrollerisproportional
to the number of instruction address bits, log2 ( ). FPGAs traditionally have a single context,
making 0, whileprocessorshavecontrollerscomposingtheprogramcounterandbranching
logic.
FPGA Example Traditional FPGAs have 1 and 1. Equation 9.1, for 4096,
computes 870K 2. ComparingwithTable7.1,weseethisisintherangeofconventional
devices.
PADDI-2 Example PADDI-2 is made from 48, 16-bit units. Each has an 8 instruction memory
( 8) and effectively 6 words of data per compute element, 6. PADDI-2 has
1121 0 32 100 1024 252
2 16 64 131 2048 281
4 31 128 162 4096 311
8 49 256 192 8192 340
16 69 512 222 16384 370
Table 9.2: for 0 5, 4, 2
Design Controller Area
MIPS-X [HHC 87, Cho89] 8M 2
PA-RISC [YFJ 87] 12M 2
VIPER [GNAB93] 12M 2
Table 9.3: Area for Instruction Control Sampling
3-inputs per EXU, 3, and an initial convergence of 4. Equation 9.1 predicts 370K 2 per
bit operation or 284M 2 for the entire array, which is about half the size of the prototype PADDI-2
die which is 576M 2.
1139.2 Peak Performance Density
Using the model, we can examine the peak computational densities from various architectural
conﬁgurations in RP-space. Figure 9.1 plots computational density against datapath width, , and
thenumberofinstructionsper functiongroup, . As increasesthere ismoresharingofinstruction
memories and less switches required in the interconnect resulting in smaller bit processingelement
cell sizes or higher densities. As increases, there are more instructions per compute element
resulting in lower densities. The effect of more instructions is more severe for smaller datapath
widths, , since there are less processing elements against which to amortize instruction overhead.
For singlecontextdesigns,there isonly afactorof2.5 differenceindensitybetween singlebit
granularity and 128-bit granularity. At this size, network effects dominate instruction effects, and
the factor of difference comes almost entirely from the difference in switching requirements. For
heavilymulticontextdevicesatthesamenumberofinstructioncontexts,thedifferencebetweenﬁne
andcoarse granularityis greatersincethe instructionmemory areadominates (SeealsoFigure 9.2).
At 1024 contexts, the 128 bit datapath is 36 denser than an array with bit-level granularity.
As the number of contexts, , increase, the device is supporting more loaded instructions; that
is, a larger on chip instruction diversity. Figure 9.2 shows how instruction density increases with
increasing numbers of contexts alongside the decrease in peak computational density.
These samedensitytrends holdif wesetaside aﬁxed amount ofdata memory. The areaoutside
of the data memory will follow the same density curves shown here.
1141
4
16
64
256
1024
c
1
4
16
64
128
w
0.2
0.4
0.6
0.8
1.0
Density
1
4
16
64
256
1024
c
1
4
16
64
128
w
4, 2, 0 5, , 0, 16384
Reference Density of 1.0 corresponds to 128, 1
Figure 9.1: Peak Computational Density Versus Contexts and Datapath Width
1151
4
16
64
256
1024
c
1
4
16
64
128
w
1/2
1/4
1/8
1/16
1/32
1/64
1/128
Density
1
4
16
64
256
1024
c
1
4
16
64
128
w
1
4
16
64
256
1024
c
1
4
16
64
128
w
1/2
1/4
1/8
1/16
1/32
1/64
1/128
1/256
1/512
1/1024
Idensity
1
4
16
64
256
1024
c
1
4
16
64
128
w
1/2
1/4
1/8
1/16
1/32
1/64
1/128
1/256
1/512
1/1024
Idensity
Left – Computational Density; Right – Instruction Density
4, 2, 0 5, , 0, 16384
Figure 9.2: Compute and Instruction Densities Versus Contexts and Datapath Width
1169.3 Granularity
As noted in the previous chapter, we can use larger granularity datapaths to reduce instruction
overheads. The utility of this optimization depends heavily on the granularity of the data which
needs to be processed. As noted in the previous section, the coarser the granularity the higher the
peak performance. However, if the architectural granularity is larger than the task data granularity,
portions of the device’s computational power will go to waste.
We canmodeltheeffectsofpuregranularitymismatchesusingtheareamodeldevelopedabove.
First, we note that the optimal conﬁguration for a given word size will always be the architecture
which has the same word size as the task. We can then determine the efﬁciency associated with
running tasks with word size on an architecture with word size , by dividing the area
required to support the task on a architecture by the area required on a architecture.
For , for some integer , the efﬁciency is simply the ratio of the bit processing
elementareas. For , the taskcan runon top ofthe low bit processingelementsin
the architecture datapath, leaving the remaining processing elements unused. The efﬁciency here
is the ratio of the area of bit processing elements from a architecture versus bit
processing elements from a architecture.
Efﬁciency 1
0
0
9 2
Note that a single-chip implementation is assumed for comparison so that there are no boundary
effects between components.
Figure 9.3 shows the efﬁciency for various architecture and task granularities. At 1, the
activeswitching areadominates. The ﬁne granularity( 1) has themostrobust efﬁciency across
task granularities. The efﬁciency drops off quickly for large grain architectures supporting ﬁne
grain tasks.
Figure 9.4 shows that the robustness shifts as the numbers of contexts increases. For 1024,
the instruction memory space dominates the area. Consequently, the redundancy which arises
when ﬁne-grained architectures run coarse-grain tasks is quite large, leading to rapidly decreasing
efﬁciency with increasing task grain size. In this regime, the coarse-grain architectures are more
robust, since the extra datapath and networking elements are moderately inexpensive compared to
the large area dedicated to instruction memory. For 1024, 32, is the most robust datapath
width as shown extracted in Figure 9.5.
1171
4
16
64 Design w
1
4
16
64
128
Architecture w
0.2
0.4
0.6
0.8
1.0
Efficiency
0.2
0.4
0.6
0.8
1.0
Efficiency
4, 2, 0 5, 1, 0, 16384
Figure 9.3: Efﬁciency as a Function of Architectural and Task Granularity for Single Context
Architectures
These robust points correspond to the mix where the context memory makes up roughly half
the area of the device.
0
1
2
9 3
At this point:
Finer grain devices running coarser granularity tasks waste, at most, a little over half of their
area – the memory area plus the switching overhead associated with ﬁner granularity.
Coarser grain devices running ﬁne-grain tasks waste at most half of their area – the unused
datapath area.
1181
4
16
64 Design w
1
4
16
64
128
Architecture w
0.2
0.4
0.6
0.8
1.0
Efficiency
0.2
0.4
0.6
0.8
1.0
Efficiency
1
4
16
64 Design w
1
4
16
64
128
Architecture w
0.2
0.4
0.6
0.8
1.0
Efficiency
0.2
0.4
0.6
0.8
1.0
Efficiency
Left – 1; Right – 16
1
4
16
64 Design w
1
4
16
64
128
Architecture w
0.2
0.4
0.6
0.8
1.0
Efficiency
0.2
0.4
0.6
0.8
1.0
Efficiency
1
4
16
64 Design w
1
4
16
64
128
Architecture w
0.2
0.4
0.6
0.8
1.0
Efficiency
0.2
0.4
0.6
0.8
1.0
Efficiency
Left – 256; Right – 1024
4, 2, 0 5, , 0, 16384
Figure 9.4: Efﬁciency as a Function of Architectural and Task Granularity
119|
1
|
2
|
4
|
8
|
16
|
32
|
64
|
128
| 0.0
| 0.1
| 0.2
| 0.3
| 0.4
| 0.5
| 0.6
| 0.7
| 0.8
| 0.9
| 1.0
 
E
f
f
i
c
i
e
n
c
y
 wdes
Figure 9.5: Efﬁciency versus Task Data Width for a 1024-context, 32-bit Granularity Device
1209.4 Contexts
We saw in Section 9.2 that the computational density is heavily dependent on the number
of instruction contexts supported. Architectures which support substantially more contexts than
required bythe application,allowa large amountof silicon areadedicated to instruction memoryto
gounused. Architectureswhichsupporttoofewcontextswillleaveactivecomputingandswitching
resources idle waiting for the time when they are needed.
We can model the effects of varying application requirements and architectural support in an
ideal setting using the area model. We assume we have a repetitive task requiring operations
which has a path length . In an ideal packing, an architecture with processing
units and instruction contexts can support the task optimally. If , the area per
processing element is larger than necessary to support the application. If , it will be
necessary to use more processing elements simply to hold the total set of instructions.
Efﬁciency 1
0
0
9 4
This relation is shown for several datapath widths, , in Figure 9.6. Again, single chip implemen-
tations are assumed for comparison.
The efﬁciency dropoff for is less severe for large datapaths, large , than for small
datapaths. Similarly, the dropoff for is less severe for small datapaths than for large
datapaths. This effect is due to the relative area contributed by instructions. In the small case,
the instruction area takes up relatively more area than in the large case, so costs of extra active
area is relatively smaller than in the large case. In the large datapath case, the instructions make
up a lower percentage of the area so the overhead for extra instructions is relatively smaller.
The 16 instruction contextcase is the most robust across this range for single bit datapaths (See
Figure 9.7). Similarly, 256 instruction contexts is the most robust for 128 (See Figure 9.8).
Neither of these cases drops much below 50% efﬁciency at either the or
extremes. These “robust” cases correspond to the points where the instruction memory area is
roughly equal to the active network and computing area. In either extreme, at most half of the
resources are being underutilized. , our robust context selection, can be deﬁned as:
0
1
2
9 5
Remember that the network resource requirements grow with array size. In the case,
where we must deploy more processing elements to handle the task, the total number of processing
elements increases causing the switching area per processing element to increase as well. This
effects acounts for the fact that the efﬁciency can drop below 50% and the approximate relation in
Equation 9.5.
1211
4
16
64
256
1024
Path Length
1
4
16
64
256
1024
c
0.2
0.4
0.6
0.8
1.0
Efficiency
1
4
16
64
256
1024
c
0.2
0.4
0.6
0.8
1.0
Efficiency
1
4
16
64
256
1024
Path Length
1
4
16
64
256
1024
c
0.2
0.4
0.6
0.8
1.0
Efficiency
1
4
16
64
256
1024
c
0.2
0.4
0.6
0.8
1.0
Efficiency
Left – 1; Right – 8
1
4
16
64
256
1024
Path Length
1
4
16
64
256
1024
c
0.2
0.4
0.6
0.8
1.0
Efficiency
1
4
16
64
256
1024
c
0.2
0.4
0.6
0.8
1.0
Efficiency
1
4
16
64
256
1024
Path Length
1
4
16
64
256
1024
c
0.2
0.4
0.6
0.8
1.0
Efficiency
1
4
16
64
256
1024
c
0.2
0.4
0.6
0.8
1.0
Efficiency
Left – 64; Right – 128
4, 2, 0 5, , 0, 16384
Figure 9.6: Efﬁciency as a Function of Task Path Length and Architectural Contexts
122|
1
|
2
|
4
|
8
|
16
|
32
|
64
|
128
|
256
|
512
|
1024
| 0.0
| 0.1
| 0.2
| 0.3
| 0.4
| 0.5
| 0.6
| 0.7
| 0.8
| 0.9
| 1.0
 
E
f
f
i
c
i
e
n
c
y
 lpath
Figure 9.7: Efﬁciency versus Task Path Length for a 16-context, Single-bit Granularity Device
|
1
|
2
|
4
|
8
|
16
|
32
|
64
|
128
|
256
|
512
|
1024
| 0.0
| 0.1
| 0.2
| 0.3
| 0.4
| 0.5
| 0.6
| 0.7
| 0.8
| 0.9
| 1.0
 
E
f
f
i
c
i
e
n
c
y
 lpath
Figure 9.8: Efﬁciency versus Task Path Length for a 256-context, 128-bit Granularity Device
1239.5 Composition
In general, we see cumulative effects of the grain size and context depth mismatches between
architectureand task requirements. Figure 9.9 shows the yielded efﬁciency versus both application
pathlengthandgrainsizefortheconventionalFPGAdesignpointofasinglecontextandasinglebit
datapath. The FPGA drops to 1% efﬁciency for large datapaths with long path lengths. Similarly,
Figure 9.10 shows the efﬁciency of a wide word ( 64), deep memory ( 1024) design point.
While this does well for large path lengths and wide data, its efﬁciency at a path length and data
sizeof oneis 0.5%. Noticehere, that thewide, coarse-graindesign pointis over 100 less efﬁcient
than the FPGA when running tasks whose requirements match the FPGA, and the FPGA is 100
less efﬁcient than said point when running tasks with coarse-grain data and deep path lengths.
In the previous sections we saw that it was possible to select reasonably robust choices for
datapath width or number of instruction contexts given that the other parameter was ﬁxed. We
also saw that the robustness criterion followed the same form; that is, the inefﬁciency overhead
can be bounded near 50% if half of the area is dedicated to instruction memory and half to active
computing resources. This does not, however, yield a single point optimum since the partitioning
of the instructions between more contexts and ﬁner-grain control is handled distinctly in the two
cases.
Figure 9.11, for instance, shows the yield for a single design point, 8, 64, across
varying task path lengths and datapath requirements. While the 8 and 64 cross-sections
are moderately robust, the efﬁciencies at the extremas are low. At 1, 1, the efﬁciency
is just under 8%, and at the 1024, 128, the efﬁciency is just over 8%. This design
point is, nonetheless, more robust across the whole space than either of the architectures shown in
Figures 9.9 and 9.10.
1241
4
16
64
256
1024
Path Length
1
4
16
64
128
Design w
0.2
0.4
0.6
0.8
1.0
Efficiency
1
4
16
64
256
1024
Path Length
1
4
16
64
128
Design w
4, 2, 0 5, 1, 1, 0, 16384
Figure 9.9: Efﬁciency for Conventional FPGA Design Point ( 1, 1)
1251
4
16
64
256
1024
Path Length
1
4
16
64
128
Design w
0.2
0.4
0.6
0.8
1.0
Efficiency
1
4
16
64
256
1024
Path Length
1
4
16
64
128
Design w
4, 2, 0 5, 1024, 64, 0, 16384
Figure 9.10: Efﬁciency for Coarse-Grain, Deep Memory Design Point ( 64, 1024)
1261
4
16
64
256
1024
Path Length
1
4
16
64
128
Design w
0.2
0.4
0.6
0.8
1.0
Efficiency
1
4
16
64
256
1024
Path Length
1
4
16
64
128
Design w
4, 2, 0 5, 64, 8, 0, 16384
Figure 9.11: Efﬁciency for Fixed 8, 64
1279.6 Summary
Theareamodelshowsushowpeakcapacitydependsongranularityorganizationandinstruction
support. We see that the penalty for ﬁne-granularity is moderate, 2.5 difference between 1
and 128, in the conﬁgurable domain where there is only instruction memory for a single
context. Thepenalty is large, 36 , in the heavymulticontext domain. We also looked atthe effects
of application granularity and path length. In both cases, we found that, given a priori knowledge
of either the task granularity or context requirements, we could set the other parameter such that
the efﬁciency did not drop signiﬁcantly below 50% for any choice of the unknown parameter. This
is signiﬁcant since the peak performance densities across the range explored differed by roughly a
factorof200 . For bothofthesecases,therobustselectioncriterionisto choosethefreeparameter
such that instruction memory accounts for one half of the processing cell area. We saw that the
effects of granularity and path length mismatches were cumulative and that FPGAs running tasks
suited for deep memory, coarse-grained architectures can be only 1% efﬁcient. If we must select
both the datapath granularity and the number of contexts obliviously, we cannot obtain a single
design point with as robust a behavior as when we only had one free parameter. A good design
point across this region of the RP-space suffers a 13 worst-case overhead.
128Part IV
New Architectures
12910. Dynamically Programmable Gate Arrays
In Chapter 9 we demonstrated that if we settle on a single word width, , we can select a robust
context depth, , such that the area required to implement any task on the architecture with ﬁxed
is at most 2 the area of using an architecture with optimal . Further, for single bit
granularities, 1, the model predicted a robust context depth 16. In contrast, the primary,
conventional, general-purposedevices with independent, bit-level control over each bit-processing
unitareField-ProgrammableGateArrays(FPGAs),whichhave 1. OuranalysisfromChapter9
suggests that we can often realize more compact designs with multicontext devices. Figure 10.1
shows the yielded efﬁciency of a 16-context, single-bit granularity device for comparison with
Figure 9.9, emphasizing the broader range of efﬁciency for these multicontext devices.
In this chapter, we introduce Dynamically Programmable Gate Arrays (DPGAs), ﬁne-grained,
multicontext devices which are often more area efﬁcient than FPGAs. The chapter features:
a characterization of where DPGAs are most area efﬁcient and why
a detailed prototype DPGA implementation
design automation for two realms of DPGA application: (1) levelized circuit evaluation and
(2) Finite-State Machine mapping
an identiﬁcation of major, pragmatic limitations to achieving the full beneﬁts which look
possible in theory
1301
4
16
64
256
1024
Path Length
1
4
16
64
128
Design w
0.2
0.4
0.6
0.8
1.0
Efficiency
1
4
16
64
256
1024
Path Length
1
4
16
64
128
Design w
4, 2, 0 5, 16, 1, 0, 16384
Figure 10.1: Efﬁciency for DPGA Design Point ( 1, 16)
13110.1 DPGA Introduction
The DPGA is a multicontext ( 1), ﬁne-grained ( 1), computing device. Initially, we
assume a single control stream ( 1). Each compute and interconnect resource has its own,
small, memory for describing its behavior (See Figure 10.2). These instruction memories are read
in parallel whenever a context (instruction) switch is indicated.
The DPGA exploits two facts:
1. The description of an operationis much smallerthan the active area necessary to perform the
operation.
2. It is seldom necessary to evaluate every gate or bit computation in a design simultaneously
in order to achieve the desired task latency or throughput.
Toillustratetheissue,considerthetaskofconvertinganASCIIHexdigitintobinary. Figure10.3
describes the basic computation required. Assuming we care about the latency of this operation, a
mapping which minimizes the critical path length using SIS [SSL 92] and Chortle [Fra92] has a
Memory
Context ID
Decode
Context ID
Decode
Interconnect
Interconnect 
  Description
Logic
Description
Context
Memory
Fixed Logic
  Representative 
Area Breakdown
Figure 10.2: LUT and Interconnect Primitives for Multicontext FPGA
if (c >= 0x30 && c <= 0x39)
res = c-0x30;
else if (c >= 0x40 && c <= 0x46)
res = c - 0x40 + 10;
else if (c >= 0x60 && c <= 0x66)
res = c - 0x60 + 10;
else
res = 0;
Figure 10.3: ASCII Hex Binary Task Description
132INORDER = C[7] C[6] C[5] C[4] C[3] C[2] C[1] C[0] ;
OUTORDER = O[3] O[2] O[1] O[0] ;
# stage 1 – 8 LUTs [C[3:0] pass through]
i0 = !C[1] * !C[2] ;
i1 = C[4] * C[5] * !C[6] * !C[7] ;
i3 = C[0] * C[1] * !C[2] ;
i4 = !C[3] * !C[4] * C[6] * !C[7] ;
i6 = !C[0] * C[2] ;
i7 = !C[0] * C[1] ;
i8 = C[0] * !C[1] ;
i11 = !C[7] * C[6] * !C[4] * !C[3] ;
# stage 2 – 9 LUTs [i1,C[3],C[1] pass through]
i5 = i0 * i1 + i3 * i4 ;
i9 = i6 * i4 + i7 * i4 + i8 * i4 ;
i10 = C[3] + i3 * i4 ;
i12 = i3 * i4 + i6 * i4;
i13 = i1 * !C[3] * C[2] ;
i14 = C[2] * !C[1] * i11 ;
i15 = i8 * i4 + i7 * i4 ;
i16 = i7 * i4 + i6 * i4 ;
i17 = i1 * !C[3] * C[0] + C[0] * i0 * i1 ;
# stage 3 – 4 LUTs
O[3] = (i10+i9)*(i5+i9);
O[2] = i12 + i13 + i14 ;
O[1] = i1 * !C[3] * C[1] + i15 ;
O[0] = i16 + i17 ; c0
c1
c2
c3
c4
c5
c6
c7
i0
i1
i3
i4
i6
i7
i8
i11
i5
i9
i10
i12
i13
i15
i16
i17
o0
o3
o2
i14
o1
Figure 10.4: 4-LUT Mapping of ASCII Hex Binary
path length of 3 and requires 21 4-LUTs. Figure 10.4 shows the LUT mapping both in equations
and circuit topology.
TraditionalPipeliningforThroughput Ifwecaredonlyaboutachievingthehighestthroughput,
we would fully pipeline this implementation such that it took in a new character on each cycle and
outputitsencodingthreecycleslater. Thispipeliningwouldrequireanadditional7LUTstopipeline
data which is needed more than one pipeline stage after being generated (i.e. 4 to retime c 3:0
for presentation to the second stage and 3 to retime c 3 , c 1 and i1 for presentation to the
ﬁnalstage–SeeFigure10.5). Consequently,weeffectivelyevaluateadesignwith 21
4-LUTs with 28 physical 4-LUTs. Typical LUT delay, including a moderate amount of local
interconnect traversal, is 7 ns (See Table 4.13). Assuming this is the only limit to cycle time, the
implementation could achieve 140 MHz operation. Notice that the only reason we had to have any
moreLUTsorLUTdescriptionsthanstrictlyrequiredbythetaskdescriptionwasinordertoperform
signal retiming based onthe dependencystructure of the computation. Using our FPGA area based
133on the model in the previous chapter, an FPGA LUT in a large array occupies 880K 2.
Consequently, this implementation requires:
28 880K 2 24 6M 2
Multicontext Implementation – Temporal Pipelining If, instead, we cared about the latency,
but did not need 140 MHz operation, we could use a multicontext device with 3 LUT descriptions
per active element ( 3). To achieve the target latency of 3 LUT delays, we need to have
enough active LUTs to implement the largest stage – the middle one. If the inputs are arriving
from some other circuit which is also operating in multicontext fashion, we must retime them as
before(Figure 10.5). Consequently, we require 3 extra LUTs in the largest stage, making for a total
12. Note that the 4 retiming LUTs added to stage 1 also bring its total LUT usage up to 12
LUTs. We end up implementing 21, with 12 and 3. If c 7:0 were
inputswhich did not change during thesethree cycles, we would only need oneextra retimingLUT
in stage 2 for i1, allowing us to use 10.
The multicontext LUT is slightly larger due to the extract contexts. Two additional contexts
add 160K 2 to the LUT area, making for 1 04M 2. The multicontext implementation
requires:
12 1M 2 12 5M 2
In contrast, a non-pipelined, single-context implementation would require 21 LUTs, for an
area of:
21 880K 2 18 5M 2
If we assume that we can pipeline the conﬁguration read, the multicontext device can achieve
comparable delay per LUT evaluation to the single context device. The total latency then is 21 ns,
as before. The throughput at the 7 ns clock rate is 48 MHz. If we do not pipeline the conﬁguration
read, as was the case for the DPGA prototype (Section 10.4), the conﬁguration read adds another
2.5 ns to the LUT delay, making for a total latency of 28.5 ns and a throughput of 35 MHz.
GeneralObservations Wewereabletorealizethisareasavingsbecausethesinglecontextdevice
had to deploy active compute and interconnect area for each portion of the task even though the
task only required a smaller number of active elements at any point in time. In general, we have
two components which combine to deﬁne the requisite area for a computational device:
1. – the total number of 4-LUTs in the design – the descriptive complexity
2. – the total number of 4-LUTs which must be evaluated simultaneously in order to achieve
the desired task time or computational throughput – the parallelism required to achieve the
temporal requirements
In an ideal packing, a computationrequiring activecomputeelementsand total 4-LUTs,
can be implemented in area:
10 1
134c0
c1
c2
c3
c4
c5
c6
c7
i0
i1
i3
i4
i6
i7
i8
i11
i5
i9
i10
i12
i13
i15
i16
i17
o0
o3
o2
c2
c0
c3
c1
c3
c1
i1
i14
o1
Figure 10.5: ASCII Hex Binary Circuit Retimed for Full Pipelining
135Equation 10.1 is a simpliﬁcation of our area model (Equation 9.1). Using the typical values
suggested in the previous chapter:
800K 2 (10.2)
78K 2 (10.3)
In practice, a perfect packing is difﬁcult to achieve due to connectivity and dependency re-
quirements such that conﬁguration memories are required. In the previous example, we
saw 3 12 36 for 21 due to retiming and packing constraints. In fact, with the
modeldescribed so far, retimingrequirementsprevent us fromimplementingthis taskon anyfewer
than 12 active LUTs. Retiming requirements are one of the main obstacles to realizing the full,
ideal beneﬁts. We will see retiming effects more clearly when we look at circuit benchmarks in
Section 10.5.
13610.2 Related Architectures
Several hardware logic simulator have been built which share a similar execution model to the
DPGA. These designs were generally motivated to reduce the area required to emulate complex
designs and, consequently, took advantage of the fact that task descriptions are small compared to
to their physical realizations in order to increase logic density.
The Logic Simulation Machine [BLMR83], and later, the Yorktown Simulation Engine (YSE)
[Den82] were the earliest such hardware emulators. The YSE was built out of discrete TTL and
MOS memories, requiring hundreds of components for each logic processor. Processors had an
8K deep instruction memory ( 8192), 128 bit instructions ( 128, 136 once
processor-to-processor interconnect is included) and produced two results per cycle ( 2).
The YSE design supported arrays of up to 256 processors ( 256), with a single controller
( 1) running the logic processors in lock step, and a full 256 256, 2-bit wide crossbar
( 1).
The Hydra processor which Arkos Design’s developed for their Pegasus hardware emulator is
a closer cousin to the DPGA [Mal94]. They integrate 32, 16-context, bit processors on each Hydra
chip ( 32, 16, 1). The logic function is an 8-input NAND with programmable input
inversions.
VEGA uses 1K-2K context memories to achieved a 7 logic description density improvement
over single context FPGAs. At 1024, VEGA is optimized to be efﬁcient for very large ratios,
: , and can be quite inefﬁcient for regular, high-throughput tasks. With 86, and a
separate controller per processor ( ), Equation 9.1 predicts a 2048 VEGA
processingelementwillhave 218M 2, whichis about8.5 smallerthanthe2048single
context processing elements which it emulates – so the area savings realized by VEGA is quite
consistent with our area model developed in Chapter 9.
Hydra and VEGA were developed independently and concurrently to the DPGA, which was
ﬁrst described in [BDK94].
Dharma [BCK93, Bha93]was designedto solve theFPGArouting problem. Logicis evaluated
in strict levels similar to the scheme used for circuit evaluation in Section 10.5 with one gate-delay
evaluation per cycle. Dharma is based on a few, monolithic crossbars ( 1) which are reused
at each level of logic. Once gates have been assigned to evaluation levels, the full crossbar makes
placement and routing trivial. While this arrangement is quite beneﬁcial for small arrays, the
scaling rate of the full crossbar makes this scheme less attractive for large arrays, , as we saw in
Section 7.2.1.
13710.3 Realm of Application
DPGAs, as with any general-purpose computing device supporting the rapid selection among
instructions, are beneﬁcial in cases where only a limited amount of functionality is needed at any
pointintime,andwhereitis necessarytorapidlyswitchamongthepossiblefunctionsneeded. That
is,ifweneedallthethroughputwecanpossiblygetoutofasinglefunction,asinthefully-pipelined
ASCIIHex Binary converterinSection10.1,thenanFPGA,orotherpurelyspatialreconﬁgurable
architecture will handle the task efﬁciently. However, when the throughput requirements from a
function arelimited or thefunction is needed only intermittently,a multicontext device can provide
a more efﬁcient implementation. In this section, we look at several, commonly arising situations
where multicontext devices are preferable to single-context devices, including:
Tasks with limited throughput requirements
Latency limited tasks
Time or data varying logical functions
We also brieﬂy revisit instruction bandwidth to see why partial reconﬁguration, alone, is not an
adequate substitute for many of these tasks.
10.3.1 Limited Throughput Requirements
Often the system environment places limits on the useful throughput for a subtask. As we
saw in the introduction to this chapter, when the raw device supports a higher throughput than that
required from the task, we can share the active resources in time among tasks or among different
portions of the same task.
RelativeProcessingSpeeds Mostdesignsarecomposedofseveral sub-componentsorsub-tasks,
each performing a task necessary to complete the entire application (See Figure 10.6). The overall
performance of the design is limited by the processing throughput of the slowest device. If the
performance of the slowest device is ﬁxed, there is no need for the other devices in the system to
process at substantially higher throughputs.
In these situations, reuse of the active silicon area on the non-bottleneck components can
improve performance or lower costs. If we are getting sufﬁcient performance out of the bottleneck
resource, then we may be able to reduce cost by sharing the gates on the non-bottleneck resources
between multiple “components” of the original design (See Figure 10.7). If we are not getting
sufﬁcient performance on the bottleneck resource and its task is parallelizable, we may be able to
employ underused resources on the non-bottleneck components to improve system performance
without increasing system cost (See Figure 10.8).
Fixed Functional Requirements Many applications have ﬁxed functional requirements. Input
processing on sensor data, display processing, or video processing all have task deﬁned processing
rates which are ﬁxed. In many applications, processing faster than the sample or display rate is not
necessary or useful. Once we achieve the desired rate, the rest of the “capacity” of the device is
138A
B
C
D
System Throughput:  25M Ops/s
Figure 10.6: Typical Multicomponent System
C
D A
B
System Throughput:  25M Ops/s
Figure 10.7: Multifunction Component in System
not required for the function. With reuse of active silicon, the residual processing capacity can be
employed on other computations.
I/O Latency and Bandwidth Device I/O bandwidth often acts as a system bottleneck, limiting
the rate at which data can be delivered to a part. This, in turn, limits the useful throughput we
can extract from the internal logic. Even when the I/O pins are heavily reused (e.g. [BTA93]),
components often have less I/O throughput than they have computational throughput. Reviewing
technology costs, we expect this bottleneck to only get worse over time.
139C
D
System Throughput:  30M Ops/s
C
A
B
Figure 10.8: Function Distribution in System
SinceI/O’s mustdriveoff-chipcapacitances,theinherentbandwidththrougheachpinisoften
lower than the logic cycle time. With on-chip logic speeds scaling faster than I/O speeds,
this bandwidth gap will only increase as technology advances.
Handling signals above 30 MHz becomes difﬁcult at the PCB level, requiring more
expensive packaging and more complex design. On-chip handling of high speed clocks is
much more manageable.
With conventional perimeter I/O pads, the number of I/O’s scales as the square root of the
internal logic area. As device capacity continues to increase, the disparity between internal
logic real estate and I/O’s provided grows larger.
When data throughput is limited by I/O bandwidth, we can reuse the internal resources to
provide a larger, effective, internal gate capacity. This reuse decrease the total number of devices
required in the system. It may also help lower the I/O bandwidth requirements by grouping larger
sets of interacting functions on each IC.
10.3.2 Latency Limited Designs
Somedesignsarelimitedbylatencynotthroughput. Here,highthroughputmaybeunimportant.
Oftenit isirrelevant howquicklywecan beginprocessingthenextdatumifthat time isshorterthan
the latency through the design. This is particularly true of applications which must be serialized
for correctness (e.g. atomic actions, database updates, resource allocation/deallocation, adaptive
feedback control).
By reusing gates and wires, we can use device capacity to implement these latency limited
operations with less resources than would be required without reuse. This will allow us to use
smaller devices to implement a function or to place more functionality onto each device.
140Cyclic dependencies Some computations have cyclic dependencies such that they cannot con-
tinue until the result of the previous computation is known. For example, we cannot reuse a
multiplier when performing exponentiationuntil the previous multiply resultis known. Finitestate
machines(FSMs)alsohavethe requirementthat theycannotbeginto calculatetheirbehaviorin the
next state, until that state is known. In a purely spatial implementation, each gate or wire performs
its function during one gate delay time and sits idle the rest of the cycle. Active resource reuse is
the most beneﬁcial way to increase utilization in cases such as these.
10.3.3 Temporally Varying or Data Dependent Functional Requirements
Another characteristic of ﬁnite state machines is that the computational task varies over time
and as a function of the input data. At any single point in time, only a small subset of the
total computational graph is needed. In a spatial implementation, all of the functionality must
be implemented simultaneously, even though only small subsets are ever used at once. This is a
general property held by many computational tasks.
Manytasks mayperform quite differentcomputationsbasedon the kindof data theyreceive. A
network interface may handle packets differently based on packet type. A computational function
may handle new data differently based on its range. Data objects of different types may require
widely different handling. Rather than providing separate, active resources for each of these
mutually exclusive cases, a multicontext device can use a minimum amount of active resources,
selecting the proper operational behavior as needed.
10.3.4 Multicontext versus Monolithic and Partial Reconﬁguration
Multicontext devices are speciﬁcally tailored to the cases where we need a limited amount of
activefunctionalityatanypoint in time,but weneed to beable toselect orchange that functionality
rapidly. This rapid switching is necessary to obtain reasonable performance for the kinds of
applications described in this section. This requirement makes reconﬁgurations from a central
memory pool, on or off chip, inadequate.
In this section, we draw out this point, reviewing the application domains identiﬁed in the
previous section. We also look at cases where one can get away without multicontext devices. At
the end of this section, we articulate a reconﬁguration rate taxonomy which allows us to categorize
both device architectures and applications.
Tasks with limited throughput requirements As we discussed in Section 10.3.1, tasks with
limitedthroughputrequirementscanbe implementedin less areausingmulticontextdevices. If, we
placed the conﬁguration contexts off-chip, the context-switch rate would be paced by the limited
bandwidthintoconﬁgurationmemory. ReturningtoourASCII Hex Binary converter, in thethree
context case, we would have to reload 12 LUT instructions between contexts 1 and 2, 4 between
contexts 2 and 3, and 12 between contexts 3 and 1. If we assume a 500MB/s RAMBUS I/O port
[Ram93] operating at peak burst performance, we can load one byte/2 ns. The evaluation time
would be:
12
8b/2 ns
12
8b/2 ns
4
8b/2 ns
141Assuming 64, as in Section 8.2 and Chapter 9, and 7 ns:
12 64
8b/2 ns
7 ns
12 64
8b/2 ns
7 ns
4 64
8b/2 ns
7 ns
(192 ns 192 ns 64 ns)
instruction load time
(21 ns)
operation time
448 ns 21 ns
469 ns
Such a solution is simultaneously: (1) over an order of magnitude slower than the multicontext
implementation, which operated at 21-28 ns, and (2) over two order of magnitude larger when
you consider the 500-700M 2 occupied by a 4Mb RAMBUS DRAM. Arguably, the DRAM could
be smaller than 4Mb, but it is not economical to build, package, and sell such small memories.
Further, noticethatthisisatinysubtaskwith12activeLUTs,whilereasonablysizedFPGAscontain
hundreds to thousands of LUTs, making the reconﬁguration time orders of magnitude slower. As
noted in Chapter 8, reconﬁguration bandwidth limitations will dictate the rate of operation rather
than the circuit path length.
Latency limited tasks The same effect described above occurs in latency limited designs. If we
want to save real-estate by reusing active area, the time to load in the next instruction may pace
operation. Off-chip memory, or an on-chip central memory pool, will suffer from the memory
bandwidth bottleneck just noted.
Data varying logical functions In ﬁnite-state machines, or other tasks which may change the
function they perform at each point in time based on the data arriving, this reconﬁguration latency
alsodetermines cycle time. Many tasks will exhibit the characteristicsidentiﬁed here – in response
to a new data item, hundreds of LUT instructionsmust be loaded before the actual task, which may
take only a few LUT delays to evaluate, can be performed.
Infrequent temporal change Of course, if the distinct pieces of functionality required change
only infrequently, and can operationally tolerate long reload latencies, then off-chip reconﬁgura-
tions may be acceptable and efﬁcient. For example, the UCLA conﬁgurable computing system
for automatic target recognition [VSCZ96] takes advantage of the fact that a loaded correlation
conﬁguration can be used against an entire image segment before a new correlation is required.
With 128 128 pixel images, a complete ﬁlter match of a 16 16 correlation template across the
full image requires roughly 1282 correlations amounting to 16K clock cycles on the correlator.
Operating at a 60 ns clock rate, this full correlation takes roughly 1 ms. The conventional FPGA
actually used for the UCLA implementation, a Xilinx XC4010, takes 10 ms to reload its conﬁgu-
ration [Xil94b]. However, as we noted in Section 7.8, the sparse encoding used by conventional
devices makes them excessively slow at reconﬁguration. Assuming a RAMBUS reconﬁguration
port and 64-bits/4-LUT, the 1600 4-LUTs on the XC4010 can be reloaded in roughly:
1600 64
8b/2 ns
25 6 s
142Here, the reload time is small compared to the loaded context operating time ( ),
such that reload has a small effect on the rate of operation. In fact, as the UCLA paper notes,
when the next context is predictable in advance and , a two context FPGA would
be able to completely overlap the loading of the next instruction with the operation in the current
conﬁguration.
Large-grain, data-dependent blocks Similarly, when performing data dependent computations
and the type of data changes slowly compared to the processing rate, long reconﬁguration times
might be acceptable. For example, a video display which can handle different video data formats
(e.g. PAL, NTSC, MPEG-1, MPEG-2, HDTV), will only have to process and display one kind of
video stream at a time. For human consumption, it will typically display the same data stream for
a long time and the 10’s of milliseconds of latency it may take to load the conﬁguration with the
appropriate display engine would not be noticeable to the human observer.
Minor conﬁguration edits Sometimes conﬁgurations need only minor edits in order to evolve
over time or be properly conﬁgured for different data types. For example, an -character text
matching ﬁlter may only require the conﬁguration of a /4 4-LUTs to change to handle a different
-character search target. If these only represent a small portion of the entire conﬁguration, the
reconﬁguration can be described as an edit on the existing conﬁguration with less bandwidth than
a full context reload. In cases like this where the edits are small, partial reconﬁguration – the
ability to efﬁciently change small portions of the conﬁguration while leaving the rest in place –
may be adequate to reduce context switch bandwidths sufﬁciently to keep reload latency low. We
see partial reconﬁguration support in modern devices from Plessey [Ple90], Atmel [Atm94], and
Xilinx [Xil96] to support conﬁguration edits such as this.
Reconﬁguration Rate Taxonomy From the above, we see three cases for conﬁguration man-
agement based on the rate at which the task requires distinct pieces of functionality and the rate at
which it is efﬁcient to change the conﬁguration applied to the active processing elements:
1. Static – the conﬁguration does not change within an operational epoch
Usage Scenario: Traditional ASIC and FPGA applications where all the functionality
is needed all the time. Particularly appropriate for throughput limited cases where one
wants all the throughput one can get out of a device for every function it provides.
Architectures: single-context FPGAs
2. Quasistatic – the conﬁguration changes slowly compared to the rate of operation upon data
Usage Scenario: Context load time is amortized across long periods of processing
with the loaded context(e.g. UCLA wireless video [JOSV95], UCLA ATR [VSCZ96],
BYU DISC [WH95], BYU run-time reconﬁgurable neural networks [EH94]).
Architectures: FPGAs with rapid reconﬁguration (e.g. Atmel [Atm94], Xilinx 6200
[Xil96]) along with traditional, in-circuit reprogrammable FPGAs for very coarse-
grained tasks
1433. Dynamic – conﬁguration changes at the same rate as data, potentially on a cycle-by-cycle
basis
Usage Scenario: Limited active resources are shared among multiple operations to
extract full usage of the active resources when the task throughput requirements are
low compared to the potential device throughput. (e.g. multicontext circuit evalu-
ation introduced above and detailed in Section 10.5, ﬁnite-state machine evaluation
(Section 10.6), interleaved, multifunction components (Section 10.7.1)).
Architectures: DPGAs, traditional processor architectures including DSPs and VLIW
processors, SIMD and Vector array processors
We canfurther subdivideconﬁgurationmanagementcapabilitiesof architectureandapplication
requirements based on whether they can take advantage of limited bandwidth conﬁguration edits:
1. Atomic – the vector of instructions across the array must change all at once
Architectures: Traditional FPGAs(e.g. Xilinx2K,3K,4K,5K[Xil94b], AlteraFLEX
8K [Alt94]), VLIW processors
2. Non-atomic – small subsets of the array instructions can be changed independently
Architectures: FPGAs supporting partial reconﬁguration (e.g. Xilinx 6200 [Xil96],
Atmel [Atm94])
Strictlyspeaking,theatomicityofconﬁgurationchangesisorthogonaltotherateofreconﬁguration.
For statically conﬁgured applications, the atomicity of reload is irrelevant since the context does
not change. The atomicity is most relevant for quasistatic conﬁguration changes since those are
the cases which beneﬁt from reduced bandwidth requirements. Dynamic architectures can change
their active instruction on a cycle-by-cycle basis so non-atomic changes do not allow an array-
wide context switch to occur any faster. However, edits to the non-active contexts on dynamic
architectures may still beneﬁt from the bandwidth reduction enabled by non-atomic updates.
14410.4 A Prototype DPGA
Jeremy Brown, Derrick Chen, Ian Eslick, and Edward Tau started a ﬁrst-generation prototype
DPGA prototype while they were taking MIT’s introductory VLSI course (6.371) during the Fall
of 1994. The chip was completed during the Spring of 1995 with additional help from Ethan
Mirsky. Andr´ e DeHon helped the group hash out the microarchitecture and oversaw the project.
The prototype was ﬁrst presented publicly in [TEC 95]. A project report containing lower level
details is available as [BCE 94].
In this section, we describe this prototype DPGA implementation. The design represents a
ﬁrst generation effort and contains considerable room for optimization. Nonetheless, the design
demonstrates the viability of DPGAs, underscores the costs and beneﬁts of DPGAs as compared
to traditional FPGAs, and highlights many of the important issues in the design of programmable
arrays. The fabricated prototype did have one timing problem which prevented it from functioning
fully, but our post mortem analysis suggests that the problem is easily avoidable.
Our DPGA prototype features:
4 on-chip conﬁguration contexts
DRAM conﬁguration cells
non-intrusive background loading
automatic refresh of dynamic memory elements
wide bus architecture for high-speed context loading
two-level routing architecture
We begin by detailing our basic DPGA architecture in Section 10.4.1. Section 10.4.2 pro-
vides highlights from our implementation including key details on our prototype DPGA IC. In
Section 10.4.3, we describe several aspects of the prototype’s operation. Section 10.4.4 extracts a
DPGA area model based on the prototype implementation. Section 10.4.5 closes out this section
on the DPGA prototype by summarizing the major lessons from the effort.
10.4.1 Architecture
Figure 10.9 depicts the basic architecture for this DPGA. Each array element is a conventional
4-input lookup table (4-LUT). Small collections of array elements, in this case 4 4 arrays, are
grouped together into subarrays. These subarrays are then tiled to compose the entire array.
Crossbars between subarrays serve to route inter-subarray connections. A single, 2-bit, global
contextidentiﬁerisdistributedthroughoutthearraytoselecttheconﬁgurationforuse. Additionally,
programming lines are distributed to read and write conﬁguration memories.
DRAM Memory The basic memory primitive is a 4 32 bit DRAM array which provides four
context conﬁgurations for both the LUT and interconnection network (See Figure 10.10). The
memory cell is a standard three transistor DRAM cell. Notably, the context memory cells are
built entirely out of N-well devices, allowing the memory array to be packed densely, avoiding the
large cost for N-well to P-well separation. The active context data is read onto a row of standard,
complementary CMOS inverters which drive LUT programming and selection logic.
145Figure 10.9: Architecture and Composition of DPGA
Array Element The array element is a 4-LUT which includes an optional ﬂip-ﬂop on its output
(Figure 10.11). Each array element contains a context memory array. For our prototype, this is the
4 32 bit memory described above. 16 bits provide the LUT programming, 12 conﬁgure the four
8-input multiplexors which select each input to the 4-LUT, and one selects the optional ﬂip-ﬂop.
The remaining three memory bits are presently unused.
Subarrays The subarray organizes the lowest level of the interconnect hierarchy. Each array
elementoutputisrunverticallyandhorizontallyacrosstheentirespanofthesubarray(Figure10.12).
Each array element can, in turn, select as an input the output of any array element in its subarray
which shares the same row or column. This topology allows a reasonably high degree of local
connectivity.
This leaf topology is limited to moderately small subarrays since it ultimately does not scale.
The row and column widths remains ﬁxed regardless of array size so the horizontal and vertical
interconnect would eventually saturate the row and column channel capacity if the topology were
scaled up. Additionally,the delay on the local interconnect increases with each additional element
in a row or column. For small subarrays, there is adequate channel capacity to route all outputs
across a row and column without increasing array element size, so the topology is feasible and
desirable. Further, theadditionaldelayforthefewelementsintheroworcolumnofasmallsubarray
is moderately small compared to the ﬁxed delays in the array element and routing network. In
general, the subarray size should be carefully chosen with these properties in mind.
146Figure 10.10: DRAM Memory Primitive
Non-Local Interconnect In addition to the local outputs which run across each row and column,
a number of non-local lines are also allocated to each row and column. The non-local lines are
driven by the global interconnect (Figure 10.12). Each LUT can then pick inputs from among the
lines which cross its array element. In the prototype, each row and column supports four non-local
lines. Each array element could thus pick its inputs from eight global lines, six row and column
neighbor outputs, and its own output. Each input is conﬁgured with an 8:1 selector as noted above
(Figure 10.11). Of course, not all combinations of 15 inputs taken 4 at a time are available with
this scheme. The inputs are arranged so any combination of local signals can be selected along
with many subsets of global signals. Freedom available at the crossbar in assigning global lines to
tracks reduces the impact of this restriction, but complicates placement.
Local Decode Row select lines for the context memories are decoded and buffered locally from
the2-bit contextidentiﬁer. Asingle decoderserviceseachrow ofarrayelementsin asubarray. One
decoder also services the crossbar memories for four of the adjacent crossbars. In our prototype,
this placed ﬁve decoders in each subarray, each servicing four array element or crossbar memory
blocks for a total of 128 memory columns. Each local decoder also contains circuitry to refresh the
DRAM memory on contexts which are not being actively read or written.
147Figure 10.11: Array Element
Figure 10.12: Subarray Local Interconnect
148Figure 10.13: Inter Subarray Interconnect
Global Interconnect Between each subarray a pair of crossbars route the subarray outputs from
one subarray into the non-local inputs of the adjacent subarray. Note that all array element outputs
are available on all four sides of the subarray. In our prototype, this means that each crossbar is a
16 8 crossbar which routes 8 of the 16 outputs to the neighboring subarray’s 8 inputs on that side
(Figure 10.13). Each 16 8 crossbar is backed by a 4 32 DRAM array to provide the 4 context
conﬁgurations. Each crossbaroutputisconﬁguredby decoding4conﬁgurationbitsto selectamong
the 16 crossbar input signals.
While the nearest neighbor interconnect is sufﬁcient for the 3 3 array in the prototype, a
larger array should include a richer interconnection scheme among subarrays. At present, we
anticipate that a mesh with bypass structure with hierarchically distributed interconnect lines will
be appropriate for larger arrays.
Programming The programming port makes the entire array look like one large, 32-bit wide,
synchronous memory. The programming interface was designed to support high-bandwidth data
transfer from an attached processor and is suitable for applications where the array is integrated on
the processor die. Any non-active context may be written during operation. Read back is provided
in the prototype primarily for veriﬁcation.
149Technology 1 CMOS, 3 metal
Subarray Area 1750 1460 =2.6M 2 (10.2M 2)
LUTs/subarray 16
LUT inputs 4
Array Element Area 640K 2
Contexts 4
Conﬁguration Bits/LUT 40
Context Memory Area/LUT 24K 2
Subarrays 9
Typical Cycle 9.5 ns
Table 10.1: DPGA Prototype Implementation Characteristics
Unit Size Composition
Die 6.8mm 6.8mm Core with pads
Core 5.6mm 4.7mm All internal logic except pads
Array Core 5.25mm 4.4mm 3 3 subarrays including crossbars (no pads)
Subarray+crossbar tile 1460 1750 Subarray + 4 adjacent crossbars and memory
Crossbar (Xbar) 495 270 16 8 Crossbar including memory
Local Decode (LD) 253 167
Array Element (AE) 275 240 Includes local routing channels
Table 10.2: Basic Component Sizes for Prototype
10.4.2 Implementation
The DPGA prototype is targeted for a 1 drawn, 0.85 effective gate length CMOSprocess with
3 metal layers and silicided polysilicon and diffusion. Table 10.1 highlights the prototype’s major
characteristics. Figure 10.14 shows the fabricated die, and Figure 10.15 shows a closeup of the
basic subarray tile containing a 4 4 array of LUTs and four inter-subarray crossbars. Table 10.2
summarizes the areas for the constituent parts.
Table 10.3 breaks down the chip area by consumers. In Table 10.3, conﬁguration memory is
divided between those supporting the LUT programming and that supporting interconnect. All
together, the conﬁguration memory accounts for 33% of the total die area or 40% of the area
used on the die. The network area, including local interconnect, wiring, switching, and network
conﬁguration area accounts for 66% of the die area or 80% of the area actually used on the die.
Leavingout the conﬁgurationmemory, the ﬁxed portion of the interconnectarea is 45% of the total
area or over half of the active die area.
Layout Inefﬁciencies The prototype could be packed more tightly since it has large blank areas
andlargeareasdedicatedtowirerouting. Amorecarefulco-designoftheinterconnectandsubarray
resources would eliminate much or all of the unused space between functional elements. Most of
150Subarray+Xbar
Tile
Xbar
AE
LD
1750um
1
4
6
0
u
m
253um 275um 495um
2
4
0
u
m
2
7
0
u
m
1
6
7
u
m
7.1mm
6
.
8
m
m
Figure 10.14: Annotated Die Photo of DPGA Prototype
the dedicated wiring channels are associated with the local interconnect within a subarray. With
careful planning, it should be possible to route all of these wires over the subarray cells in metal 2
and 3. As a result, a careful design might be 40-50% smaller than our ﬁrst generation prototype.
Memory
Area From the start, we suspected that memory density would be a large determinant of array
size. Table 10.3 demonstrates this to be true. In order to reduce the size of the memory, we
employed a 3 transistor DRAM cell design as shown in Figure 10.10. To keep the aspect ratio on
the 4 32 memory small, we targeted a very narrow DRAM column (See Figure 10.16).
151Figure 10.15: Photo of DPGA Subarray and Crossbar Tile
Function Elements Percent
Logic Total 16
Memory array 10
Memory decode 3
Fixed Logic 3
Network Total 66
Memory array 15
Memory decode 5
Switching 19
Wiring 27
Blank 18
Total 100
Table 10.3: Array Core Area Breakdown by Programmable Function
152Figure 10.16: Plot of Array Element with Conﬁguration Memory
Unfortunately, this emphasis on aspect ratio did not allow us to realize the most area efﬁcient
DRAMimplementation(SeeTable10.4). Inparticular, ourDRAMcellwas7.6 19.2 , oralmost
600 2. A tight DRAM should have been 75-80 2, or about 300 2. Our tall and thin DRAM was
viaand wirelimitedand hencecouldnot bepacked as area efﬁcientlyas a moresquare DRAM cell.
One key reason for targeting a low aspect ratio was to balance the number of interconnect
channels available in each array element row and column. However, with 8 interconnect signals
currently crossing each side of the array element, we are far from being limited by saturated
interconnect area. Instead, array element cell size is largely limited by memory area. Further, we
route programming lines vertically into each array element memory. This creates an asymmetric
need for interconnect channel capacity since the vertical dimension needs to support 32 signals
while the horizontal dimension need only support a dozen memory select and control lines.
For future array elements we should optimize memory cell area with less concern about aspect
ratio. In fact, the array element memory can easily be split in half with 16 bits above the ﬁxed
logic in the array element and 16 below. This rearrangement will also allow us to distribute only
16 programming lines to each array element if we load the top and bottom 16 bits separately. This
revision does not sacriﬁce total programming bandwidth if we load the top or bottom half of a pair
153Element # Size
DRAM Cell 4 7.6 19.2
Output Buffer 1 7.6 28.0
Pass Gates 1 7.6 26.4
Column 7.6 131.2
Table 10.4: DRAM Column Breakdown
Function Percent
of Total of Memory
Memory decode 8 25
Memory cells 15 44
Buffer and gate 10 31
Total 33 100
Table 10.5: Memory Area Breakdown
of adjacent array elements simultaneously.
Table 10.5 further decomposes memory area percentages by function. We have already noted
thatatightDRAMcellwouldbehalftheareaoftheprototypeDRAMcellandanSRAMcellwould
be twice as large. Using these breakdowns and assuming commensurate savings in proportion to
memory cell area, the tight DRAM implementation would save about 7% total area over the
current design. An SRAM implementation would be, at most, 15% larger. In practice, the SRAM
implementation would probably be only 5-10% larger for a 4-context design since the refresh
controlcircuitrywould nolonger beneeded. Of course,as one goesto greater numbersofcontexts,
the relative area differences for the memory cells will provide a larger contribution to overall die
size.
Memory Timing The memory in the fabricated prototype suffered from a timing problem due to
theskewbetweenthereadprechargeenableandtheinternalwriteenable. AsshowninFigure10.10,
the read bus is precharged directly on the high edge of the clock signal clk. The internal write
enable, iwe, controls write-back during refresh. iwe and the write enable signals, we 4:0 , are
generatedby the local decoderand drivenacross an entire row of four array elements in a subarray,
which makes for a 128-bit wide memory. Both iwe and we 4:0 are pipelined signals which
transition on the rising edge of clk. On the rising edge of clk, we have a race between the turn
on of the precharge transistor and the turn off of iwe and we 4:0 . Since clk directly controls
the precharge transistor, precharge begins immediately. However, since iwe and we 4:0 are
registered, it takes a clock-to-q delay before they can begin to change. Further, since there are
128 consumers spread across 1100 , the signal propagation time across the subarray is non-trivial.
Consequently, it is possible for the precharge to race through write enables left on at the end of
154Figure 10.17: Plot of Crossbar with Conﬁguration Memory
a previous cycle and overwrite memory. This problem is most acute for the memories which are
farthest from the local decoders.
Empirically, we noticed that the memories farthest from the local decoder lost their valuesafter
shorttimeperiods. Intheextremecasesoftheinputandoutputpads,whichwereoftenveryfarfrom
their conﬁguration memories, the programmed values were overwritten almost immediately. The
memoriescloserto thelocal decoderweremore stable. The arrayelementsadjacent to thedecoders
were generally quite reliable. After identifying this potential failure mode, we simulated explicit
skew between clk and the write enables in SPICE. In simulation, the circuit could tolerate about
1.5 ns of skew between clk and the write enables before the memory values began to degrade.
We were able to verify that refresh was basically operational. By continually writing to single
context, we can starve other contexts from ever refreshing. When we forced the chip into this
mode, data disappeared from the non-refreshed memories very quickly. The time constant on this
decay was signiﬁcantly different from the time constants observed due to the timing decay giving
us conﬁdence that the basic refresh scheme worked aside from the timing skew problem.
Obviously, the circuit should have been designed to tolerate this kind of skew. A simple and
robust solution would have been to disable the refresh inverter or the writeback path directly on
clk to avoid simultaneously enabling both the precharge and writeback transistors. Alternately,
the precharge could have been gated and distributed consistently with the write enables.
Crossbar Implementation
Tokeepconﬁgurationmemorysmall, thecrossbarenableswerestoredencodedinconﬁguration
memory then decoded for crossbar control. The same 4 32 memory used for the array element
155Delay
Path symbol slow-speed nominal
CLK conﬁguration memory stable 4 ns 2.5 ns
CLK XBAR out 1 8.5 ns 5 ns
XBAR in XBAR out 4.5 ns 2.5 ns
LUT in LUT output (1 level) 9 ns 3.5 ns
CLK CLK (maximum, DRAM leakage) 200 ns
Table 10.6: Estimated Timings
was used to control each 16 8 crossbar. Note that the entire memory is 128K 2. The crossbar
itself is 535K 2, making the pair 660K 2. Had we not encoded the crossbar controls, the crossbar
memory alone would have occupied 512K 2 before we consider the crossbar itself. This suggests
that the encoding was marginally beneﬁcial for our four context case, and would be of even greater
beneﬁt for greater numbers of contexts. For fewer contexts, the encoding would not necessarily be
beneﬁcial.
Timing
Table 10.6 summarizes the key timing estimates for the DPGA prototype at the slow-speed
and nominal process points. As shown, context switches can occur on a cycle-by-cycle basis and
contribute only a few nanoseconds to the operational cycle time. Equation 10.4 relates minimum
achievable cycle time to the number of LUT delays, , and crossbar crossings, in the critical
path of a design.
10 4
These estimates suggest a heavily pipelined design which placed only one level of lookup table
logic( 1) andonecrossbartraversal( 1) ineachpipeline stagecouldachieve60-100MHz
operation allowing for a context switch on every cycle. Our prototype, however, does not have a
suitably aggressive clocking, packaging, or i/o design to actually sustain such a high clock rate.
DRAM refresh requirements force a minimum operating frequency of 5MHz.
Pipelining Two areas for pipelining are worth considering. Currently, the context memory read
time happens at the beginning of each cycle. In many applications, the next context is predictable
and the next context read can be pipelined in parallel with operation in the current context. This
pipelining can hide the additional latency, . Also, notice that the inter-subarray crossbar
delay is comparable to the LUT plus local interconnect delay. For aggressive implementations,
allowing the non-local interconnect to be pipelined will facilitate small microcycles and very high
throughput operation. Pipelining both the crossbar routing and the context reads could potentially
allow a 3-4 ns operational cycle.
15610.4.3 Component Operation
Inter-context Communication The only method of inter-context communication for the proto-
type is through the array element output register. That is, when a succeeding context wishes to
use a value produced by the immediately preceding context, we enable the register output on the
associatedarrayelementin the succeedingcontext(See Figure 10.11). Whenthe clockedge occurs
signaling the end of the preceding cycle, the signal value is latched into the output register and the
new context programming is read. In the new context, the designated array element output now
providesthevalue storedfrom thepreviouscontext ratherthanthe value producedcombinationally
bytheassociatedLUT.This,ofcourse,makestheassociatedLUTalogical choicetouseto produce
values for the new context’s succeeding context since it cannot be used combinationally in the new
context, itself.
In the prototype, the array element output register is also the only means of state storage.
Consequently, it is not possible to perform orthogonal operations in each context and preserve
context-dependent state.
Note that a single context which acts as a shift register can be used to snapshot, ofﬂoad, and
reload the entire state of a context. In an input/output minimal case, all the array elements in the
array can be linked into a single shift register. Changing to the shift register context will allow
the shift register to read all the values produced by the preceding context. Clocking the device in
this context will shift data to the output pin and shift data in from the input pin. Changing from
this context to an operating context which registers the needed inputs will insert loaded values for
operation. Such a scheme may be useful to take snapshots during debugging or to support context
switches where it is important to save state. If only a subset of the array elements in the array
produce meaningful state values, the shift register can be built out of only those elements. If more
input/outputsignals can be assigned to data onload and ofﬂoad, a parallel shift register can be built,
limiting the depth and hence onload/ofﬂoad time.
ContextSwitching Contextswitchesaresignaledbyacontextstrobe. Ifcontextstrobeisasserted
at a clock edge, a context read occurs. If context strobe is not asserted, the component remains in
the same context.
DRAM Refresh DRAM memory is refreshed under one of two conditions:
1. Context Read – Whenever a context is read, that context will be refreshed.
2. Clocked in Same Context – Whenever a clock cycle occurs but the context strobe is not
asserted and there is no read or write to any of the memories serviced by a particular local
decoder, the “next”contextis refreshed. Each localdecodermaintainsa modulofourcounter
which it increments each time it is able to perform a context refresh in this manner. If the
arraystaysinthesamecontextformorethatfourcycles,everyfourthcycle,theactivecontext
value is refreshed through we 4 (See Figure 10.10).
This refresh scheme does place some restrictions on the context sequencing, but it allows most
common patterns. In particular, proper refresh occurs if we:
continually cycle through all contexts, switching on each clock cycle
157stay in each context for several clock cycles
If one continually changes contexts on every clock cycle and only walks through a small subset
of the entire set of contexts, the non-visited contexts will be starved from refresh. For example,
switching continually between context zero and context one would prevent contexts two and three
from ever getting a refresh.
The context memory typically gets very stylized usage. For any single memory, writes are
infrequent. Common usage patterns are to read through all the contexts or to remain in one context
foranumberofcycles. AssuchtheusagepatterniscomplementarytoDRAMrefreshrequirements.
Background Load Notice from Figure 10.10 that the write path is completely separate from the
read path. This allows background writes to occur orthogonally to normal operation. Data can be
read through the refresh inverter and we 4 with iwe disabled to prevent refresh or writeback.
At the same time, new data arriving on ewv may be loaded through ewe and written into memory
using we 3:0 .
10.4.4 Prototype Context Area Model
Using the prototype areas, we can formulate a simpliﬁed model for the area of an -context
DPGA array element.
From the prototype:
544K 2
12K 2
24K 2
48K 2
Based on this area model, our robust context point, , is 45, 23, and 11, respectively for each of
the various memory implementations.
10.4.5 Prototype Conclusions
The prototype demonstrates that efﬁcient, dynamically programmable gate arrays can be im-
plemented which support a single cycle, array-wide context switch. As noted in Chapter 9 and
the introduction to this chapter, when the instruction description area is small compared to the
active compute and network area, multiple context implementations are more efﬁcient than single
context implementations for a large range of application characteristics. The prototype bears out
this relationship with the context memory for each array element occupying at least an order of
magnitude less area than the ﬁxed logic and interconnect area. The prototype further shows that
the context memory read overhead can be small, only a couple of nanoseconds.
The prototype has room for improvement in many areas:
158Tighter layout – Both the memory cells and the ﬁxed portions of the array elements are
larger thannecessaryfor thefunctionprovidedandcan be improvedwith morecarefullayout
and a better understanding of the relative areas of constituent components.
Pipeline interconnect – Over half of the cycle time on a minimum, typical cycle is in the
non-local interconnect, suggesting it may be worthwhile to optionally pipeline the non-local
interconnect to increase the achievable computational density.
Overly limited routing – The routing in the prototype is limited and probably inadequate
for automated mapping.
Amortize refresh logic –Separate refresh controlis providedfor every four memoryblocks.
This function can likely be moved to higher levels and the associated area amortized over a
larger number of memory blocks.
Additional, architectural, areas for improvement over the prototype are identiﬁed in the following
sections and in the next chapter.
15910.5 Circuit Evaluation
One large class of workloads for traditional FPGAs is conventional circuit evaluation. In this
section, we look at circuit levelization where traditional circuits are automatically mapped into
multicontext implementations. In latency limited designs (Section 10.5.2), the DPGA can achieve
comparable delays in less area. In applications requiring limited task throughput (Section 10.5.3),
DPGAs can often achieve the required throughput in less area.
10.5.1 Levelization
Levelizedlogic isaCADtechniqueforautomatictemporalpipeliningofexistingcircuitnetlists.
Bhat refers to this technique as temporal partitioning in the context of the Dharma architecture
[Bha93]. The basic idea is to assign an evaluation context to each gate so that:
with a total ordering on contexts, all the inputs to context are computed in that context or
one of its predecessors (i.e. in a context such that )
we minimizetotalcapacityrequiredforthecalculationbyminimizingthemaximumresource
usage per context
With latency constraints, we may further require that the levelized network not take any more
steps than necessary. With this assignment, the series of contexts 0, 1, , 1
evaluates the logic netlist in sequence over microcycles. With a full levelization scheme, the
number of contexts used to evaluate a netlist is equal to the critical path in the netlist.
For sake of illustration, Figure 10.18 shows a fraction of the ASCII Hex binary circuit
extracted from Figure 10.4. The critical paths (e.g. c 1 i8 i15 o 1 ) is three elements
long. Spatially implemented, this netlist evaluates a 6 gate function in 3 cycles using 6 physical
gates. In three cycles, these three gates could have provided 6 3 18 gate evaluations, so we
underutilize all the gates in the circuit. The circuit can be fully levelized as shown in Figure 10.18
( 0 = i1,i4,i7,i8 , 1 = i15 , 2 = o 1 ) or partially levelized combining the ﬁnal two stages
( 0 = i1,i4,i7,i8 , 1 = i15,o 1 ). If the inputs are held constant during evaluation, we need
only four LUTs to evaluate either case. In this case, if the inputsare not held constant, we will need
two additional LUTs in 0 for retiming so there is no beneﬁt to multicontext evaluation. However,
as we saw in Section 10.1, for the whole circuit, even with retiming, the total number of LUTs
neededfor levelizedevaluationis smallerthanthenumberneededin afully spatialimplementation.
Recall also from Section 10.1, that grouping is one of the limits to even levelization. When
there is slack in the circuit network, the slack gives us some freedom in the context placement
for components outside of the critical path. In general, this slack should be used to equalize
context size, minimizing the number of active LUTs required to achieve the desired task latency or
throughput. As we see in both this subcircuit and the full circuit, signal retiming requirements also
serve to increase the number of active elements we need in each evaluations level.
10.5.2 Latency Limited Designs
As noted in Section 10.3.2, many tasks are latency limited. This could be due to data depen-
dencies, such that the output from the previous evaluation must be available before subsequent
160INORDER = C[7] C[6] C[5] C[4] C[3] C[1] C[0] ;
OUTORDER = O[1] ;
# stage 1 – 8 LUTs [C[3], C[1] pass through]
i1 = C[4] * C[5] * !C[6] * !C[7] ;
i4 = !C[3] * !C[4] * C[6] * !C[7] ;
i7 = !C[0] * C[1] ;
i8 = C[0] * !C[1] ;
# stage 2 – 9 LUTs [i1,C[3],C[1] pass through]
i15 = i8 * i4 + i7 * i4 ;
# stage 3 – 4 LUTs
O[1] = i1 * !C[3] * C[1] + i15 ;
c0
c1
c3
c4
c5
c6
c7
i0
i1
i3
i4
i6
i7
i8
i11
i5
i9
i10
i13
114
i15
i17
o0
o3
o2
o1
Figure 10.18: ASCII Hex Binary Subcircuit
evaluation may begin. Alternately, this subtask may be the latency limiting portion of some larger
computational task. Further, the task may be one where the repetition rate is not hight, but the
response time is critical. In these cases, multicontext evaluation will allow implementations with
fewer active LUTs and, consequently, less implementation area.
We use the MCNC circuit benchmark suite to characterize the beneﬁts of multicontext evalua-
tion. Each benchmarkcircuit is mappedto a netlist of 4-LUTs using sis [SSL 92] for technology
independent optimization and Chortle [Fra92] [BFRV92] for LUT mapping. Since we are as-
suming latency is critical in this case, both sis and Chortle were run in delay mode. No
modiﬁcations to the mapping and netlist generation were made for levelized computation.
LUTs are initially assigned to evaluation contexts randomly without violating the circuit
dataﬂowrequirements. Asimple,annealing-basedswappingscheduleisthenusedtominimizetotal
evaluation costs. Evaluation cost is taken as the number of 4-LUTs in the ﬁnal mapping including
theLUTsaddedtoperformretiming. Table10.7showsthecircuitsmappedto atwo-contextDPGA,
and Table 10.8, a four-context DPGA. Table 10.9 shows the full levelization case – that is circuits
are mapped to an -context DPGA, where is equal to the number of LUT delays in the circuit’s
critical path. The tables break out the number of active LUTs in the multicontext implementations
to show the effects of signal retiming requirements.
161Single Context 2-Context Area
Circuit 4-LUTs levels Model Area 4-LUTs 4-LUTs Model Area Ratio
(M 2) w/retime (M 2) 2-ctx
1-ctxt
5xp1 55 6 48.3 35 42 40.2 0.831
9sym 155 5 136.1 140 144 137.7 1.012
9symml 130 5 114.1 115 119 113.8 0.997
C499 406 7 356.5 219 219 209.4 0.587
C880 289 9 253.7 153 161 153.9 0.607
alu2 323 10 283.6 214 224 214.1 0.755
apex6 454 5 398.6 277 284 271.5 0.681
apex7 158 5 138.7 91 96 91.8 0.662
b9 55 3 48.3 33 43 41.1 0.851
clip 162 6 142.2 127 136 130.0 0.914
cordic 529 8 464.5 426 435 415.9 0.895
count 128 4 112.4 83 101 96.6 0.859
des 2749 8 2413.6 1397 1653 1580.3 0.655
e64 385 4 338.0 229 271 259.1 0.766
f51m 152 7 133.5 104 112 107.1 0.802
misex1 24 3 21.1 14 15 14.3 0.681
misex2 58 4 50.9 31 40 38.2 0.751
rd73 157 5 137.8 117 124 118.5 0.860
rd84 381 5 334.5 317 322 307.8 0.920
rot 398 8 349.4 217 263 251.4 0.720
sao2 98 5 86.0 61 71 67.9 0.789
vg2 92 5 80.8 51 65 62.1 0.769
z4ml 13 4 11.4 8 10 9.6 0.838
Active LUT Ratio 0.65
Averages LUT+Retime/FPGA-LUT Ratio 0.73
Area Ratio 0.79
800K 2
78K 2
Table 10.7: MCNC Circuit Benchmarks – Latency Limited – Two-Context DPGA Implemenation
162Single Context 4-Context Area
Circuit 4-LUTs levels Model Area 4-LUTs 4-LUTs Model Area Ratio
(M 2) w/retime (M 2) 4-ctx
1-ctxt
5xp1 55 6 48.3 19 24 26.7 0.553
9sym 155 5 136.1 66 75 83.4 0.613
9symml 130 5 114.1 53 63 70.1 0.614
C499 406 7 356.5 119 150 166.8 0.468
C880 289 9 253.7 91 121 134.6 0.530
alu2 323 10 283.6 107 125 139.0 0.490
apex6 454 5 398.6 127 246 273.6 0.686
apex7 158 5 138.7 49 87 96.7 0.697
b9 55 3 48.3 24 40 44.5 0.921
clip 162 6 142.2 56 65 72.3 0.508
cordic 529 8 464.5 226 243 270.2 0.582
count 128 4 112.4 44 70 77.8 0.693
des 2749 8 2413.6 854 1110 1234.3 0.511
e64 385 4 338.0 128 186 206.8 0.612
f51m 152 7 133.5 58 66 73.4 0.550
misex1 24 3 21.1 9 14 15.6 0.739
misex2 58 4 50.9 17 32 35.6 0.699
rd73 157 5 137.8 57 64 71.2 0.516
rd84 381 5 334.5 152 161 179.0 0.535
rot 398 8 349.4 119 214 238.0 0.681
sao2 98 5 86.0 33 43 47.8 0.556
vg2 92 5 80.8 34 50 55.6 0.688
z4ml 13 4 11.4 6 9 10.0 0.877
Active LUT Ratio 0.36
Averages LUT+Retime/FPGA-LUT Ratio 0.49
Area Ratio 0.62
800K 2
78K 2
Table 10.8: MCNC Circuit Benchmarks – Latency Limited – Four-Context DPGA Implemenation
163Single Context Context/Level Area
Circuit 4-LUTs levels Model Area 4-LUTs 4-LUTs Model Area Ratio
(M 2) w/retime (M 2) level-ctx
1-ctxt
5xp1 55 6 48.3 13 23 29.2 0.604
9sym 155 5 136.1 102 111 132.1 0.971
9symml 130 5 114.1 85 93 110.7 0.970
C499 406 7 356.5 93 144 193.8 0.544
C880 289 9 253.7 55 106 159.2 0.627
alu2 323 10 283.6 55 92 145.4 0.513
apex6 454 5 398.6 128 256 304.6 0.764
apex7 158 5 138.7 40 92 109.5 0.789
b9 55 3 48.3 22 45 46.5 0.964
clip 162 6 142.2 54 63 79.9 0.562
cordic 529 8 464.5 136 184 262.0 0.564
count 128 4 112.4 48 69 76.7 0.683
des 2749 8 2413.6 456 915 1303.0 0.540
e64 385 4 338.0 132 186 206.8 0.612
f51m 152 7 133.5 45 55 74.0 0.555
misex1 24 3 21.1 12 15 15.5 0.736
misex2 58 4 50.9 19 33 36.7 0.721
rd73 157 5 137.8 60 67 79.7 0.578
rd84 381 5 334.5 187 195 232.1 0.694
rot 398 8 349.4 66 204 290.5 0.831
sao2 98 5 86.0 33 43 51.2 0.595
vg2 92 5 80.8 34 51 60.7 0.751
z4ml 13 4 11.4 6 9 10.0 0.877
Active LUT Ratio 0.34
Averages LUT+Retime/FPGA-LUT Ratio 0.50
Area Ratio 0.70
800K 2
78K 2
Table 10.9: MCNC Circuit Benchmarks – Latency Limited – Context per Level DPGA Impleme-
nation
164From the mapped results, we see a 30-40% overall area reduction using multicontext FPGAs,
with some designs achieving almost 50%. For this collection of benchmark circuits, which has
an average critical path length of 5-6 4-LUT delays, a four context DPGA gives the best, overall,
results. Note that retiming requirements for these circuits dictates that each context contain, on
average, 50% of the LUTs in the original design. Without the retiming requirements, 60-70% area
savings look possible without increasing evaluation path length.
In addition to task delay requirements, three effects are working together here to limit the
number of contexts which are actually beneﬁcial for for these circuits:
1. packing limitations
2. retiming requirements
3. non-trivial, ﬁnite instruction area
The annealing step explicitly minimized total LUT throughput including retiming. Nonetheless,
looking at the total number of LUTs actually used, we see the number of active LUTs actually used
for computation continues to decrease as the number of contexts increase, while the total number
of LUTs tends to level out due to retiming saturation. Figure 10.19 shows the area breakdown of
these effects in terms of the number of LUTs and total area required as a function of the number of
contexts used for the des benchmark. Figures 10.20 and 10.21 show similar data for C880 and
alu2.
Timing There are two potential sources of additional latency for the multicontext cases versus
the single context cases.
1. stage balancing time
2. context-switch time
When the number of LUT delays in the path is not an even multiple of the number of contexts,
it is not possible to allocate an even number of LUT delays to each context. For example, since
the des circuit takes eight LUT delays to evaluate, a three context implementation will place three
LUT delays in two of the three contexts and two LUT delays in the third. In a simple clocking
scheme, each context would get the same amount of time. In the des case, that would be three
LUT delays, making the total evaluation time nine LUT delays.
Changingcontexts will addsome latency overheadatleast for registeringvalues during context
switches. Inthecircuitevaluationcase,thenextcontextisalwaysdeterministicandcouldeffectively
be pipelined in parallel with evaluation of the previous context. From the DPGA prototype, we
saw that LUT-to-LUT delay was roughly 6 ns and the context read was 2.5 ns. Register clocking
overhead is likely to be on the order of 1 ns. This gives:
6 ns
(Pipelined Read) 1 ns
(Non-Pipelined Read) 2 5 ns
165 LUTs+retime
 LUTs
 Ideal LUTs
|
1
|
2
|
3
|
4
|
5
|
6
|
7
|
8
| 300
| 600
| 900
| 1200
| 1500
| 1800
| 2100
| 2400
| 2700
| 3000
 Contexts
 
L
U
T
s
 LUT Area
|
1
|
2
|
3
|
4
|
5
|
6
|
7
|
8
| 0.80
| 0.90
| 1.00
| 1.10
| 1.20
| 1.30
| 1.40
| 1.50
 Contexts
 
A
r
e
a
 
i
n
 
[
M
 
2
]
 Mapped Area
 Ideal Area
|
1
|
2
|
3
|
4
|
5
|
6
|
7
|
8
| 400
| 700
| 1000
| 1300
| 1600
| 1900
| 2200
| 2500
 Contexts
 
A
r
e
a
 
i
n
 
[
M
 
2
]
Ideal Case Perfect packing and no retiming overhead
Figure 10.19: Area Breakdown versus Number of Contexts for des Benchmark
In the pipelined read case, we add 1 ns per context switch or at most 1
6 17% delay to the critical
path. In the non-pipelined read case, we add 2.5 ns per context switch, or at most 2 5
6 42% delay
to the critical path.
The area for the multicontext implementation is smaller and the number of LUTs involved is
smaller. Asaresult,theinterconnecttraversedineachcontextmaybemorephysicallyandlogically
local, thus contributing less to the LUT-to-LUT delay.
166Area In the model used, we assume that the basic interconnect area per LUT is the same in the
single and multiple context case. Since the total number of LUTs needed for the multicontext
implementationis smaller, the multicontext implementation can use an array with fewer LUTs than
the single context implementation. We saw in Section 7.6 that interconnect area grows with array
size, so the area going into interconnect will be less for the multicontext array assuming the Rent
parameter remains the same.
Area for Improvement The results presented in this section are based on:
1. LUT area model numbers
2. DPGA architecture resembling the DPGA prototype
3. conventional circuit netlist mapping
It may be possible to achieve better results by improving each of these areas.
1. component area – The model assumes an instruction storage cost based on 64 instruction
bits/LUT and conventional SRAM memory cells. Smaller context area can be achieved by
tighter instruction encoding (e.g. Section 7.8) or smaller memory cells (e.g. DRAM used in
the prototype DPGA described in Section 10.4).
2. architecture – The largest gap between the ideal case and practice is in retiming. The
architecture can be modiﬁed to better handle retiming (See Chapter 11).
3. mapping–LUTmappingwhichissensitivetotheretimingcostsmaybecapableofgenerating
netlists with lower retiming requirements.
10.5.3 Limited Task Throughput
In Section 10.3.1, we saw that system and application requirements often limit the throughput
required out of each individual subtask or circuit. When throughput requirements are limited, we
can often meet the throughput requirement with fewer active LUTs than design LUTs, realizing a
smaller and more economical implementation.
To characterize this opportunity we again use the MCNC circuit benchmarks. sis and
Chortle are used for mapping, as before. Since we are assuming here that the target crite-
ria is throughput, both sis and Chortle are run in delay mode. As before, no modiﬁcations to
the mapping and netlist generation are made.
For baseline comparison in the single-context FPGA case, we insert retiming registers in the
mapped design to achieve the required throughput. That is, if we wish to produce a new result
every LUT delays, we add pipelining registers every LUTs in the critical path. For example,
if the critical path on a circuit is 8 LUT delays long and the desired throughput is one result ever
2LUT delays, we break thecircuit intofour pipeline stages, adding registersevery 2 LUTdelaysin
the original circuit. We use a simple annealing algorithm to assign non-critical path LUTs in order
to minimize the number of retiming registers which must be added to the design.
Similarly, we divide the multicontext case into separate spatial pipeline stages such that the
path length between pipeline registers is equal to the acceptable period between results. The LUTs
167withina phase arethen evaluatedin multicontextfashionusing theavailable contexts. Again, ifthe
critical path on a circuit is 8 LUT delays long and the desired throughput is one result every 2 LUT
delays, we break the circuit into four spatial pipeline stages, adding registers ever 2 LUT delays in
theoriginalcircuit. Thespatialpipelinestageisfurthersubdividedintotwotemporalpipelinestages
which are evaluated using two contexts. This multicontext implementation switches contexts on a
oneLUTdelayperiod. Similarly,if thedesiredthroughputwasonly oneresultevery 4LUTdelays,
the design would be divided into 2 spatial pipeline stages and up to 4 temporal pipeline stages,
dependingon the number of contexts available on the target device. The same annealing algorithm
is used to assign spatial and temporal pipeline stages to non-critical path LUTs in a manner which
minimizes the number of total design and retiming LUTs required in the levelized circuit.
As the throughput requirements diminish, we can generally achieve smaller implementations.
Unfortunately, as noted in the previous section retiming requirements prevent us from effectively
using a large number of contexts to decrease implementation area. For the alu2 benchmark,
Table 10.10 shows how LUT requirements vary with throughput and Table 10.11 translates the
LUT requirements into areas based on the model parameters used in the previous section. Fig-
ure 10.22 plots the areas from Table 10.11. Table 10.12 recasts the areas from Table 10.11 as
ratios to the the best implementation area at a given throughput. For this circuit, the four or ﬁve
contextimplementationis 45%smallerthanthesinglecontextimplementationforlowthroughput
requirements.
Tables 10.14 through 10.16 highlight area ratios at three throughput points for the entire
benchmark set. For reference, Table 10.13 summarizes the number of mapped design LUTs and
pathlengthsforthenetlistsusedfortheseexperiments. Weseethatthe2-4contextimplementations
are 20-30% smaller than the single context implementations for low throughput requirements.
168clocks LUTs including Retiming
per Contexts
result 1 2 3 4 5 6 7 8
1 585 585 585 585 585 585 585 585
2 353 347 347 347 347 347 347 347
3 286 252 252 252 252 252 252 252
4 240 207 161 161 161 161 161 161
5 216 188 161 139 139 139 139 139
6 212 185 156 139 126 126 126 126
7 189 145 143 139 124 118 118 118
8 189 145 143 137 124 118 110 110
9 189 145 134 125 124 118 110 110
10 178 138 129 122 120 106 96 86
11 178 138 128 120 99 99 96 86
12 178 138 128 120 99 99 96 86
13 178 128 128 120 99 99 96 86
14 178 128 127 116 99 99 96 86
15 178 127 124 116 99 99 86 86
16 178 126 116 116 99 99 86 86
17 178 125 116 116 99 99 86 86
18 178 125 116 116 99 99 86 86
19 169 91 86 73 70 69 68 68
20 169 91 86 73 68 67 67 66
Design Luts 169
Critical Path 19
Table 10.10: Multicontext Implementations of alu2 versus Throughput (LUTs)
169 LUTs+retime
 LUTs
 Ideal LUTs
|
1
|
2
|
3
|
4
|
5
|
6
|
7
|
8
|
9
| 30
| 90
| 150
| 210
| 270
 Contexts
 
L
U
T
s
 Mapped Area
 Ideal Area
|
1
|
2
|
3
|
4
|
5
|
6
|
7
|
8
|
9
| 40
| 70
| 100
| 130
| 160
| 190
| 220
| 250
 Contexts
 
A
r
e
a
 
i
n
 
[
M
 
2
]
Ideal Case Perfect packing and no retiming overhead
Figure 10.20: Area Breakdown versus Number of Contexts for C880 Benchmark
170 LUTs+retime
 LUTs
 Ideal LUTs
|
1
|
2
|
3
|
4
|
5
|
6
|
7
|
8
|
9
|
10
| 30
| 90
| 150
| 210
| 270
| 330
 Contexts
 
L
U
T
s
 Mapped Area
 Ideal Area
|
1
|
2
|
3
|
4
|
5
|
6
|
7
|
8
|
9
|
10
| 40
| 70
| 100
| 130
| 160
| 190
| 220
| 250
| 280
 Contexts
 
A
r
e
a
 
i
n
 
[
M
 
2
]
Ideal Case Perfect packing and no retiming overhead
Figure 10.21: Area Breakdown versus Number of Contexts for alu2 Benchmark
171clocks Model Area in M 2
per Contexts
result 1 2 3 4 5 6 7 8
1 513.6 589.7 635.3 680.9 726.6 772.2 817.8 863.5
2 309.9 349.8 376.8 403.9 431.0 458.0 485.1 512.2
3 251.1 254.0 273.7 293.3 313.0 332.6 352.3 372.0
4 210.7 208.7 174.8 187.4 200.0 212.5 225.1 237.6
5 189.6 189.5 174.8 161.8 172.6 183.5 194.3 205.2
6 186.1 186.5 169.4 161.8 156.5 166.3 176.1 186.0
7 165.9 146.2 155.3 161.8 154.0 155.8 165.0 174.2
8 165.9 146.2 155.3 159.5 154.0 155.8 153.8 162.4
9 165.9 146.2 145.5 145.5 154.0 155.8 153.8 162.4
10 156.3 139.1 140.1 142.0 149.0 139.9 134.2 126.9
11 156.3 139.1 139.0 139.7 123.0 130.7 134.2 126.9
12 156.3 139.1 139.0 139.7 123.0 130.7 134.2 126.9
13 156.3 129.0 139.0 139.7 123.0 130.7 134.2 126.9
14 156.3 129.0 137.9 135.0 123.0 130.7 134.2 126.9
15 156.3 128.0 134.7 135.0 123.0 130.7 120.2 126.9
16 156.3 127.0 126.0 135.0 123.0 130.7 120.2 126.9
17 156.3 126.0 126.0 135.0 123.0 130.7 120.2 126.9
18 156.3 126.0 126.0 135.0 123.0 130.7 120.2 126.9
19 148.4 91.7 93.4 85.0 86.9 91.1 95.1 100.4
20 148.4 91.7 93.4 85.0 84.5 88.4 93.7 97.4
800K 2
78K 2
Table 10.11: Multicontext Implementations of alu2 versus Throughput (Area)
172clocks Area/Best Area
per Contexts
result 1 2 3 4 5 6 7 8
1 1.00 1.15 1.24 1.33 1.41 1.50 1.59 1.68
2 1.00 1.13 1.22 1.30 1.39 1.48 1.57 1.65
3 1.00 1.01 1.09 1.17 1.25 1.32 1.40 1.48
4 1.21 1.19 1.00 1.07 1.14 1.22 1.29 1.36
5 1.17 1.17 1.08 1.00 1.07 1.13 1.20 1.27
6 1.19 1.19 1.08 1.03 1.00 1.06 1.13 1.19
7 1.14 1.00 1.06 1.11 1.05 1.07 1.13 1.19
8 1.14 1.00 1.06 1.09 1.05 1.07 1.05 1.11
9 1.14 1.00 1.00 1.00 1.06 1.07 1.06 1.12
10 1.23 1.10 1.10 1.12 1.17 1.10 1.06 1.00
11 1.27 1.13 1.13 1.14 1.00 1.06 1.09 1.03
12 1.27 1.13 1.13 1.14 1.00 1.06 1.09 1.03
13 1.27 1.05 1.13 1.14 1.00 1.06 1.09 1.03
14 1.27 1.05 1.12 1.10 1.00 1.06 1.09 1.03
15 1.30 1.06 1.12 1.12 1.02 1.09 1.00 1.06
16 1.30 1.06 1.05 1.12 1.02 1.09 1.00 1.06
17 1.30 1.05 1.05 1.12 1.02 1.09 1.00 1.06
18 1.30 1.05 1.05 1.12 1.02 1.09 1.00 1.06
19 1.75 1.08 1.10 1.00 1.02 1.07 1.12 1.18
20 1.76 1.09 1.11 1.01 1.00 1.05 1.11 1.15
Table 10.12: Multicontext Implementations of alu2 versus Throughput (Area Ratios)
173 1 context
 2 context
 4 context
|
0
|
5
|
10
|
15
|
20
|
25
| 0
| 100
| 200
| 300
| 400
| 500
| 600
 Clocks per Result
 
A
r
e
a
 
i
n
 
[
M
 
2
]
Figure 10.22: Area versus Throughput for Multicontext Implemenations of alu2 Benchmark
174Circuit Mapped Design LUTs Path Length
5xp1 46 10
9sym 123 7
9symml 108 8
C499 85 10
C880 176 21
alu2 169 19
apex6 248 9
apex7 77 7
b9 46 7
clip 121 9
cordic 367 13
count 46 16
des 1267 13
e64 230 9
f51m 45 17
misex1 20 6
misex2 38 8
rd73 105 10
rd84 150 9
rot 293 16
sao2 73 9
vg2 60 9
z4ml 8 7
Table 10.13: Benchmark Set Area – Mapped Characteristics
175Area/Best Area
Circuit Contexts
1 2 3 4 5 6 7 8
5xp1 1.00 1.15 1.24 1.33 1.41 1.50 1.59 1.68
9sym 1.00 1.15 1.24 1.33 1.41 1.50 1.59 1.68
9symml 1.00 1.15 1.24 1.33 1.41 1.50 1.59 1.68
C499 1.00 1.15 1.24 1.33 1.41 1.50 1.59 1.68
C880 1.00 1.15 1.24 1.33 1.41 1.50 1.59 1.68
alu2 1.00 1.15 1.24 1.33 1.41 1.50 1.59 1.68
apex6 1.00 1.15 1.24 1.33 1.41 1.50 1.59 1.68
apex7 1.00 1.15 1.24 1.33 1.41 1.50 1.59 1.68
b9 1.00 1.15 1.24 1.33 1.41 1.50 1.59 1.68
clip 1.00 1.15 1.24 1.33 1.41 1.50 1.59 1.68
cordic 1.00 1.15 1.24 1.33 1.41 1.50 1.59 1.68
count 1.00 1.15 1.24 1.33 1.41 1.50 1.59 1.68
des 1.00 1.15 1.24 1.33 1.41 1.50 1.59 1.68
e64 1.00 1.15 1.24 1.33 1.41 1.50 1.59 1.68
f51m 1.00 1.15 1.24 1.33 1.41 1.50 1.59 1.68
misex1 1.00 1.15 1.24 1.33 1.41 1.50 1.59 1.68
misex2 1.00 1.15 1.24 1.33 1.41 1.50 1.59 1.68
rd73 1.00 1.15 1.24 1.33 1.41 1.50 1.59 1.68
rd84 1.00 1.15 1.24 1.33 1.41 1.50 1.59 1.68
rot 1.00 1.15 1.24 1.33 1.41 1.50 1.59 1.68
sao2 1.00 1.15 1.24 1.33 1.41 1.50 1.59 1.68
vg2 1.00 1.15 1.24 1.33 1.41 1.50 1.59 1.68
z4ml 1.00 1.15 1.24 1.33 1.41 1.50 1.59 1.68
Clocks per Result = 1
Table 10.14: Selected Area/Throughput Points for Benchmark Set (1 Clock/Result)
176Area/Best Area
Circuit Contexts
1 2 3 4 5 6 7 8
5xp1 1.30 1.17 1.05 1.01 1.00 1.06 1.13 1.14
9sym 1.69 1.04 1.00 1.05 1.07 1.13 1.20 1.27
9symml 1.59 1.00 1.00 1.06 1.07 1.13 1.20 1.19
C499 1.13 1.06 1.06 1.04 1.00 1.06 1.13 1.19
C880 1.24 1.15 1.16 1.15 1.19 1.07 1.00 1.06
alu2 1.23 1.10 1.10 1.12 1.17 1.10 1.06 1.00
apex6 1.10 1.00 1.04 1.12 1.14 1.20 1.27 1.34
apex7 1.10 1.00 1.02 1.08 1.13 1.20 1.27 1.32
b9 1.03 1.00 1.05 1.13 1.20 1.28 1.35 1.43
clip 1.64 1.20 1.04 1.05 1.00 1.00 1.06 1.12
cordic 1.58 1.21 1.27 1.00 1.01 1.02 1.00 1.06
count 1.00 1.04 1.12 1.10 1.18 1.25 1.17 1.24
des 1.19 1.00 1.05 1.06 1.12 1.14 1.16 1.22
e64 1.42 1.00 1.07 1.15 1.22 1.30 1.38 1.45
f51m 1.17 1.00 1.05 1.04 1.11 1.15 1.15 1.07
misex1 1.24 1.00 1.08 1.15 1.23 1.31 1.39 1.36
misex2 1.23 1.00 1.08 1.15 1.19 1.21 1.28 1.36
rd73 1.63 1.00 1.00 1.07 1.10 1.12 1.19 1.20
rd84 1.78 1.02 1.00 1.07 1.14 1.22 1.29 1.34
rot 1.00 1.01 1.03 1.03 1.10 1.13 1.15 1.20
sao2 1.55 1.00 1.02 1.10 1.11 1.18 1.25 1.32
vg2 1.34 1.00 1.02 1.04 1.11 1.14 1.21 1.28
z4ml 1.00 1.15 1.24 1.33 1.41 1.50 1.59 1.68
average 1.31 1.05 1.07 1.09 1.13 1.17 1.21 1.25
Clocks per Result = 10
Table 10.15: Selected Area/Throughput Points for Benchmark Set (10 Clock/Result)
177Area/Best Area
Circuit Contexts
1 2 3 4 5 6 7 8
5xp1 1.48 1.03 1.03 1.02 1.00 1.06 1.13 1.19
9sym 1.69 1.04 1.00 1.04 1.07 1.13 1.20 1.27
9symml 1.77 1.09 1.10 1.00 1.07 1.13 1.20 1.27
C499 1.21 1.04 1.00 1.02 1.06 1.11 1.15 1.22
C880 1.16 1.02 1.00 1.06 1.05 1.02 1.04 1.10
alu2 1.76 1.09 1.11 1.01 1.00 1.05 1.11 1.15
apex6 1.16 1.00 1.08 1.08 1.15 1.23 1.30 1.37
apex7 1.12 1.00 1.04 1.10 1.15 1.22 1.29 1.34
b9 1.05 1.00 1.05 1.12 1.20 1.28 1.35 1.43
clip 1.79 1.10 1.06 1.00 1.07 1.09 1.15 1.17
cordic 1.96 1.15 1.05 1.00 1.07 1.10 1.11 1.17
count 1.00 1.05 1.05 1.10 1.14 1.21 1.25 1.32
des 1.47 1.00 1.01 1.02 1.09 1.13 1.17 1.23
e64 1.43 1.00 1.08 1.15 1.23 1.31 1.39 1.46
f51m 1.52 1.08 1.00 1.03 1.05 1.06 1.13 1.13
misex1 1.34 1.00 1.08 1.15 1.23 1.31 1.39 1.46
misex2 1.23 1.00 1.08 1.11 1.19 1.21 1.28 1.36
rd73 1.72 1.05 1.03 1.00 1.07 1.13 1.20 1.27
rd84 1.78 1.02 1.00 1.06 1.09 1.16 1.23 1.30
rot 1.27 1.01 1.00 1.05 1.09 1.15 1.22 1.27
sao2 1.55 1.00 1.00 1.04 1.08 1.15 1.15 1.22
vg2 1.39 1.03 1.00 1.04 1.11 1.15 1.21 1.28
z4ml 1.00 1.15 1.24 1.33 1.41 1.50 1.59 1.68
average 1.43 1.04 1.05 1.07 1.12 1.17 1.23 1.29
Clocks per Result = 20
Table 10.16: Selected Area/Throughput Points for Benchmark Set (20 Clock/Result)
178Area for Improvement The results shown here aremoderately disappointing. Retimingrequire-
ments prevent us from collapsing the number of active LUTs substantially as we go to deeper
multicontext implementations. As with the previous section, the results presented in this section
are based on our area model, the prototype DPGA architecture, and conventional circuit netlist
mapping. More than in the previous section, the results here also depend upon the experimental
temporal partitioning CAD software. Groupings into temporal and spatial pipelining stages are
more rigid than necessary, so better packing may be possible with more ﬂexible stage assignment.
Theretiming limitationsidentiﬁedhere alsomotivatearchitecturalmodiﬁcations whichwe will see
in the next chapter.
179|
0
|
2
|
4
|
6
|
8
|
10
|
12
|
14
|
16
|
18
|
20
|
22
|
24
| 0.00
| 0.10
| 0.20
| 0.30
| 0.40
| 0.50
| 0.60
| 0.70
| 0.80
| 0.90
| 1.00
 Number of Contexts (C)
 
A
r
e
a
 
R
a
t
i
o
Figure 10.23: versus for Coarse-grain Interleaved Contexts
Time-Sliced Interleaving The retiming limitation we are encountering here arises largely from
packingthecircuitintoalimitednumberofLUTsandserializingthecommunicationofintermediate
results. An alternate strategy would be to share a larger group of LUTs more coarsely between
multiple subcircuits in a time-sliced fashion. That is, rather than trying to sequentialize the
evaluation, we retain the full circuit, or a partially sequentialized version, and only invoke it
periodically.
Considering again our alu2 example, for moderately low throughput tasks, one context may
hold the 169 mapped design LUTs, while other contexts hold other, independent, tasks. A two
contextDPGA couldalternateswitch betweenevaluatingthealu2exampleandsome othercircuit
or task. In this two context case, the amortized area would be:
2
1
2
169 2 81M 2
Note that 81M 2 is smaller than the 91M 2 area which the two context, non-interleaved imple-
mentation achieved and smaller than the 84-85M 2 for the four and ﬁve context implementation
(Table 10.11). Further interleaving can yield even lower amortized costs. e.g.
4
1
4
169 4 47M 2
This coarse-grain interleaving achieves a more ideal reduction in area:
1
10 5
Figure 10.23plots the area ratio versus . Note that the ratiois ultimatelybounded
by the ratio, which is roughly 10% for the model parameters assumed throughout this
section. On the negative side,
180Coarse-grained interleaving is only suitable for very low throughput or when the tasks
themselves have a moderately short evaluation path to begin with.
Each task cannot be given its own, independent set of LUTs, but must share a larger number
of LUTs with separate tasks.
18110.6 Temporally Varying Logic – Finite State Machines
As we noted in Sections 10.3.2 and 10.3.3, the performance of a ﬁnite state machine is dictated
by its latency rather than its throughput. Since the next state calculation must complete and be fed
backtotheinputoftheFSMbeforethenextstatebehaviorcanbegin,thereisnobeneﬁttobegained
from spatial pipelining within the FSM logic. Temporal pipelining can be used within the FSM
logic to increasegate andwire utilizationasseen in Section 10.5.2. Finitestatemachines, however,
happen to have additional structure over random logic which can be exploited. In particular, one
never needs the full FSM logic at any point in time. During any cycle, the logic from only one
state is active. In a traditional FPGA, we have to implement all of this logic at once in order to
get the full FSM functionality. With multiple contexts, each context need only contain a portion
of the state graph. When the machine transitions to a state whose logic resides in another context,
we can switch contexts making a different portion of the FSM active. National Semiconductor,
for example, exploits this feature in their multicontext programmable logic array (PLA), MAPL
[Haw91].
10.6.1 Example
Figure 10.24 shows a simple, four-state FSM for illustrative purposes. The conventional,
single-context implementation requires four 4-LUTs, one to implement each of Dout and NS1
and two to calculate NS0. Figure 10.25 shows a two-context DPGA implementation of this same
FSM. The design is partitioned into two separate circuits based on the original state variable S1.
The two circuits are placed in separate contexts and NS1 is used to select the circuit to execute as
appropriate. Each circuit only requires three 4-LUTs, making the overall design smaller than the
ﬂat, single context implementation.
182FSM Description
Idle (00):
if (Acyc & myAddr & Read)
goto Wait1
else
goto Idle
Wait1 (01):
goto Data
Data (10):
Assert Dout
goto Wait2
Wait2 (11):
goto Idle
FSM Logic
Dout = S1*/S0
NS0 = /S1*/S0*Acyc*myAddr*Read + S1*/S0
NS1 = /S1*S0 + S1*/S0
state
FSM Logic
Dout
Acyc
myAddr
Read
Figure 10.24: Simple FSM Example
Context 0 (S1=0)
Dout = 0
NS0 = /S0*Acyc*myAddr*Read
NS1 = S0
Context 1 (S1=1)
Dout = 0
NS0 = /S0
NS1 = /S0
FSM Logic
Dout
Acyc
myAddr
Read
st
context
Figure 10.25: Two Context Implementation of Simple FSM Example
18310.6.2 Full Temporal Partitioning
In the most extreme case, each FSM state is assigned its own context. The next state computa-
tion simply selects the appropriate next context in which to operate. Tables 10.17 and 10.18 show
the reduction in area and path delay which results from state-per-context multiple context imple-
mentation of the MCNC FSM benchmarks. FSMs were mapped using mustang [DMNSV88].
Logic minimization and LUT mapping were done with espresso, sis, and Chortle. For
single context FSM implementations, both one-hot and dense encodings were synthesized and the
best mapping was selected. The multicontext FSM implementations use dense encodings so the
statespeciﬁcation can directlyserve as the contextselect. For multicontext implementations,delay
andcapacityaredictatedbythelogicrequiredforthelargestandsloweststate. Onaverage,thefully
partitioned,multicontextimplementationis 35-45%smallerthatthesinglecontextimplementation.
Many FSMs are 3-5 smaller.
Timing From Tables 10.17 and 10.18, the multicontext FSM implementations generally have
one or two fewer logic levels in their critical path than the single context implementations when
mappedfor minimumlatency. The multicontextimplementationshave aneven greaterreductionin
path length when mapped for minimum area. The multicontext FSMs, however, require additional
time to distribute the context select and perform the multi-context read. i.e.
Levels (10.6)
Levels (10.7)
Recall 2 5 ns from the prototype and 6 ns with a typical amount of switch-
ing. Properly engineered, context distribution should take a few nanoseconds, which means the
multicontext and single-context implementations run at comparable speeds when the multicontext
implementationhasonefewerLUTdelaysinitscriticalpaththanthesingle-contextimplementation.
10.6.3 Partial Temporal Partitioning
The capacity utilization and delay are often dictated by a few of the more complex states. It is
often possible to reduce the number of contexts required without increasing the capacity required
or increasing the delay. Tables 10.19 and 10.20 show the cse benchmark partitioned into various
numbers of contexts and optimized for area or path delay, respectively. These partitions were
obtained by partitioning along mustang assigned state bits starting with a four bit state encoding.
Figures10.26 and10.27 plotthe LUTcount, area, anddelay data fromthe tablesversusthenumber
of contexts employed.
One thing we note from both the introductory example (Figures 10.24 and 10.25) and the cse
example is that the full state-per-context case is not always the most area efﬁcient mapping. In the
introductory example, once we had partitioned to two contexts, no further LUT reduction could be
realized by going to four contexts. Consequently, the four context implementation would be larger
than the two context implementation owing to the deeper context memories. In the cse example,
the reduction in LUTs associated with going to going from 8 to 11 or 11 to 16 contexts saved less
area than the cost of the additional context memories.
184Single Context Context per State Ratio Delta
FSM States Levels 4 Area Levels 4 Area Levels
[M 2] [M 2]
bbara 10 6 25 22.0 1 6 9.5 0.43 5
bbsse 16 4 50 43.9 3 12 24.6 0.56 1
bbtas 6 3 7 6.1 1 5 6.34 1.0 2
beecount 7 4 14 12.3 1 7 9.4 0.77 3
cse 16 6 83 72.9 2 15 30.7 0.42 4
dk14 7 4 58 50.9 1 8 10.8 0.21 3
dk15 4 12 25 22.0 1 7 7.8 0.35 11
dk16 27 5 80 70.2 1 8 23.2 0.33 4
dk17 8 6 19 16.7 1 6 8.5 0.51 5
dk512 15 2 20 17.6 1 7 13.8 0.79 1
donﬁle 24 2 46 40.4 1 6 16.0 0.40 1
ex1 20 7 120 105.4 2 26 61.4 0.58 5
ex4 14 7 21 18.4 1 13 24.6 1.33 6
ex6 8 5 57 50.0 1 11 15.7 0.31 4
keyb 19 7 112 98.3 4 14 32.0 0.32 3
mc 4 2 8 7.0 1 7 7.8 1.10 1
modulo12 12 6 12 10.5 1 5 8.7 0.82 5
planet 48 6 150 131.7 1 25 113.6 0.86 5
pma 24 6 82 72.0 2 15 40.1 0.56 4
s1 20 5 137 120.3 5 25 59.0 0.49 0
s1488 48 6 152 133.5 3 27 122.7 0.92 3
s1a 20 5 72 63.2 7 21 49.6 0.78 -2
s208 18 4 38 33.4 1 7 15.4 0.46 3
s27 6 2 5 4.4 1 4 5.1 1.20 1
s386 13 5 42 36.9 2 12 21.8 0.59 3
s420 18 3 40 35.1 1 7 15.4 0.44 2
s510 47 5 54 47.4 1 13 58.1 1.22 4
s8 5 4 12 10.5 1 4 4.7 0.45 3
s820 25 6 92 80.8 3 30 82.5 1.02 3
sand 32 7 178 156.3 5 30 98.9 0.63 2
sse 16 4 50 43.9 3 12 24.6 0.56 1
styr 30 7 186 163.3 4 21 65.9 0.40 3
tbk 32 8 340 298.5 6 33 108.8 0.36 2
Average 0.64 3
Table 10.17: Full Partitioning of MCNC FSM Benchmarks (Area Target)
185Single Context Context per State Ratio Delta
FSM States Levels 4 Area Levels 4 Area Levels
[M 2] [M 2]
bbara 10 3 40 35.1 1 6 9.5 0.27 2
bbsse 16 3 60 52.7 2 14 28.7 0.54 1
bbtas 6 2 9 7.9 1 5 6.3 0.80 1
beecount 7 2 19 16.7 1 7 9.4 0.57 1
cse 16 4 97 85.2 2 15 30.7 0.36 2
dk14 7 3 67 58.8 1 8 10.8 0.18 2
dk15 4 3 37 32.5 1 7 7.8 0.24 2
dk16 27 3 83 72.9 1 8 23.2 0.32 2
dk17 8 2 26 22.8 1 6 8.5 0.37 1
dk512 15 2 20 17.6 1 7 13.8 0.79 1
donﬁle 24 2 46 40.4 1 6 16.0 0.40 1
ex1 20 4 151 132.6 2 26 61.4 0.46 2
ex4 14 2 25 22.0 1 13 24.6 1.12 1
ex6 8 3 62 54.4 1 11 15.7 0.29 2
keyb 19 4 150 131.7 3 26 59.3 0.45 1
mc 4 2 8 7.0 1 7 7.8 1.10 1
modulo12 12 1 13 11.4 1 5 8.7 0.76 0
planet 48 4 172 151.0 1 25 113.6 0.75 3
pma 24 4 139 122.0 2 15 40.1 0.33 2
s1 20 4 195 171.2 3 30 70.8 0.41 1
s1488 48 4 183 160.7 2 28 127.2 0.79 2
s1a 20 3 107 93.9 4 30 70.8 0.75 -1
s208 18 3 40 35.1 1 7 15.4 0.44 2
s27 6 2 5 4.4 1 4 5.0 1.16 1
s386 13 4 54 47.4 2 12 21.8 0.46 2
s420 18 3 40 35.1 1 7 15.4 0.44 2
s510 47 3 76 66.7 1 13 58.1 0.87 2
s8 5 2 13 11.4 1 4 4.8 0.42 1
s820 25 3 137 120.3 3 30 82.5 0.69 0
sand 32 4 224 196.7 3 43 141.7 0.72 1
sse 16 3 60 52.7 2 14 28.7 0.54 1
styr 30 5 285 250.2 3 23 72.2 0.29 2
tbk 32 5 510 447.8 4 42 138.4 0.31 1
Average 0.56 1.36
Table 10.18: Full Partitioning of MCNC FSM Benchmarks (Delay Target)
186|
0
|
2
|
4
|
6
|
8
|
10
|
12
|
14
|
16
| 0
| 20
| 40
| 60
| 80
| 100
 Number of Contexts (C)
 
N
4
L
U
T
|
0
|
2
|
4
|
6
|
8
|
10
|
12
|
14
|
16
| 20
| 30
| 40
| 50
| 60
| 70
| 80
| 90
 Number of Contexts (C)
 
A
r
e
a
 
 
i
n
 
[
M
 
2
]
|
0
|
2
|
4
|
6
|
8
|
10
|
12
|
14
|
16
| 1
| 2
| 3
| 4
| 5
| 6
| 7
| 8
| 9
 Number of Contexts (C)
 
L
o
g
i
c
 
L
e
v
e
l
s
 
(
L
)
Figure 10.26: Area and Delay versus Number of Contexts for cse FSM Benchmark (Area Target)
187|
0
|
2
|
4
|
6
|
8
|
10
|
12
|
14
|
16
| 0
| 20
| 40
| 60
| 80
| 100
| 120
| 140
| 160
 Number of Contexts (C)
 
N
4
L
U
T
|
0
|
2
|
4
|
6
|
8
|
10
|
12
|
14
|
16
| 20
| 40
| 60
| 80
| 100
| 120
| 140
 Number of Contexts (C)
 
A
r
e
a
 
 
i
n
 
[
M
 
2
]
|
0
|
2
|
4
|
6
|
8
|
10
|
12
|
14
|
16
| 1
| 2
| 3
| 4
| 5
| 6
 Number of Contexts (C)
 
L
o
g
i
c
 
L
e
v
e
l
s
 
(
L
)
Figure10.27: AreaandDelay versusNumberofContextsforcse FSMBenchmark(Delay Target)
188Multicontext Implementations for CSE FSM
Contexts Levels 4 Area Delta
(one-hot)
1 6 83 72.9 1.00 0
(dense)
1 8 102 89.6 1.23 -2
2 6 56 53.5 0.73 0
4 5 35 38.9 0.53 1
8 5 19 27.1 0.37 1
11 2 18 29.8 0.41 4
16 2 15 30.7 0.42 4
Table 10.19: Area and Delay versus Number of Contexts for cse FSM Benchmark (Area Target)
Multicontext Implementations for CSE FSM
Contexts Levels 4 Area Delta
(one-hot)
1 4 97 85.2 1.00 0
(dense)
1 5 156 137.0 1.60 -1
2 4 83 79.3 0.93 0
4 4 36 40.0 0.47 0
8 4 22 31.3 0.37 0
11 2 18 29.8 0.35 2
16 2 15 30.7 0.36 2
Table 10.20: Area and Delay versus Number of Contexts for cse FSM Benchmark (Delay Target)
189Tables 10.21 through 10.25 show the benchmark set mapped to various multicontext imple-
mentations for minimum area. All partitioning is performed along mustang state bits. For these
results, we examined all possible state bits along which to split and chose the best set. On average
across the benchmark set, the 8-context mapping saves over 40% in area versus the best single-
context case. The best multicontext mapping is often 3-5 smaller than the best single context
mapping.
190Best LUTs by Number of Context
FSM States Single Dense Encodings
Context 1 2 4 8 16 32 64
bbara 10 25 25 19 12 8 6 6 6
bbsse 16 50 68 39 20 15 12 12 12
bbtas 6 7 7 5 5 5 5 5 5
beecount 7 14 14 11 7 7 7 7 7
cse 16 83 102 56 35 19 15 15 15
dk14 7 58 58 22 8 8 8 8 8
dk15 4 25 25 7 7 7 7 7 7
dk16 27 80 162 57 27 8 8 8 8
dk17 8 19 19 6 6 6 6 6 6
dk512 15 20 21 7 7 7 7 7 7
donﬁle 24 46 162 57 31 6 6 6 6
ex1 20 120 193 85 59 39 31 26 26
ex4 14 21 21 20 14 13 13 13 13
ex6 8 57 83 31 17 11 11 11 11
keyb 19 112 173 65 37 22 17 14 14
mc 4 8 8 7 7 7 7 7 7
modulo12 12 12 12 5 5 5 5 5 5
planet 48 150 346 122 80 38 29 27 25
pma 24 82 82 78 45 24 18 15 15
s1 20 137 196 144 57 44 28 25 25
s1488 48 152 305 153 129 52 34 28 27
s1a 20 72 136 73 51 38 24 21 21
s208 18 38 55 35 14 9 8 7 7
s27 6 5 5 5 4 4 4 4 4
s386 13 42 64 35 19 13 12 12 12
s420 18 40 54 33 15 10 8 7 7
s510 47 54 133 95 35 18 13 13 13
s8 5 12 12 13 9 4 4 4 4
s820 25 92 245 92 62 45 32 30 30
sand 32 178 358 139 95 44 33 30 30
sse 16 50 68 39 20 15 12 12 12
styr 30 186 387 133 67 40 28 21 21
tbk 32 340 513 137 75 48 35 33 33
Table 10.21: MCNC FSM Benchmarks LUTs v/s Number of Contexts (Area Target)
191Best Area [M 2] by Number of Context
FSM States Single Dense Encodings
Context 1 2 4 8 16 32 64
bbara 10 21.9 21.9 18.2 13.3 11.4 12.3 19.8 34.8
bbsse 16 43.9 59.7 37.3 22.2 21.4 24.6 39.6 69.5
bbtas 6 6.1 6.1 4.8 5.6 7.1 10.2 16.5 29.0
beecount 7 12.3 12.3 10.5 7.8 10.0 14.3 23.1 40.5
cse 16 72.9 89.6 53.5 38.9 27.1 30.7 49.4 86.9
dk14 7 50.9 50.9 21.0 8.9 11.4 16.4 26.4 46.3
dk15 4 21.9 21.9 6.7 7.8 10.0 14.3 23.1 40.5
dk16 27 70.2 142.2 54.5 30.0 11.4 16.4 26.4 46.3
dk17 8 16.7 16.7 5.7 6.7 8.5 12.3 19.8 34.8
dk512 15 17.6 18.4 6.7 7.8 10.0 14.3 23.1 40.5
donﬁle 24 40.4 142.2 54.5 34.5 8.5 12.3 19.8 34.8
ex1 20 105.4 169.5 81.3 65.6 55.5 63.5 85.7 150.6
ex4 14 18.4 18.4 19.1 15.6 18.5 26.6 42.8 75.3
ex6 8 50.0 72.9 29.6 18.9 15.7 22.5 36.3 63.7
keyb 19 98.3 151.9 62.1 41.1 31.3 34.8 46.1 81.1
mc 4 7.0 7.0 6.7 7.8 10.0 14.3 23.1 40.5
modulo12 12 10.5 10.5 4.8 5.6 7.1 10.2 16.5 29.0
planet 48 131.7 303.8 116.6 89.0 54.1 59.4 89.0 144.8
pma 24 72.0 72.0 74.6 50.0 34.2 36.9 49.4 86.9
s1 20 120.3 172.1 137.7 63.4 62.7 57.3 82.4 144.8
s1488 48 133.5 267.8 146.3 143.4 74.0 69.6 92.3 156.4
s1a 20 63.2 119.4 69.8 56.7 54.1 49.2 69.2 121.6
s208 18 33.4 48.3 33.5 15.6 12.8 16.4 23.1 40.5
s27 6 4.4 4.4 4.8 4.4 5.7 8.2 13.2 23.2
s386 13 36.9 56.2 33.5 21.1 18.5 24.6 39.6 69.5
s420 18 35.1 47.4 31.5 16.7 14.2 16.4 23.1 40.5
s510 47 47.4 116.8 90.8 38.9 25.6 26.6 42.8 75.3
s8 5 10.5 10.5 12.4 10.0 5.7 8.2 13.2 23.2
s820 25 80.8 215.1 88.0 68.9 64.1 65.5 98.9 173.8
sand 32 156.3 314.3 132.9 105.6 62.7 67.6 98.9 173.8
sse 16 43.9 59.7 37.3 22.2 21.4 24.6 39.6 69.5
styr 30 163.3 339.8 127.1 74.5 57.0 57.3 69.2 121.6
tbk 32 298.5 450.4 131.0 83.4 68.4 71.7 108.8 191.1
Table 10.22: MCNC FSM Benchmarks Area v/s Number of Contexts (Area Target)
192Best Delay by Number of Context
FSM States Single Dense Encodings
Context 1 2 4 8 16 32 64
bbara 10 6 6 4 3 3 1 1 1
bbsse 16 4 6 6 4 3 3 3 3
bbtas 6 3 3 1 1 1 1 1 1
beecount 7 4 4 4 1 1 1 1 1
cse 16 6 8 6 5 5 2 2 2
dk14 7 4 4 4 1 1 1 1 1
dk15 4 12 12 1 1 1 1 1 1
dk16 27 5 4 5 8 1 1 1 1
dk17 8 6 6 1 1 1 1 1 1
dk512 15 2 10 1 1 1 1 1 1
donﬁle 24 2 4 6 5 1 1 1 1
ex1 20 7 4 7 7 5 4 2 2
ex4 14 7 7 3 2 1 1 1 1
ex6 8 5 8 6 4 1 1 1 1
keyb 19 7 4 7 5 4 5 4 4
mc 4 2 2 1 1 1 1 1 1
modulo12 12 6 6 1 1 1 1 1 1
planet 48 6 4 6 5 5 4 2 1
pma 24 6 6 8 5 5 3 2 2
s1 20 5 5 5 6 5 4 5 5
s1488 48 6 9 7 5 5 5 3 3
s1a 20 5 5 6 5 5 5 7 7
s208 18 4 4 6 3 3 2 1 1
s27 6 2 2 2 1 1 1 1 1
s386 13 5 5 5 4 3 2 2 2
s420 18 3 5 5 4 3 2 1 1
s510 47 5 6 5 3 3 1 1 1
s8 5 4 4 3 3 1 1 1 1
s820 25 6 5 7 7 6 4 3 3
sand 32 7 5 7 6 6 5 5 5
sse 16 4 6 6 4 3 3 3 3
styr 30 7 5 7 7 6 6 4 4
tbk 32 8 6 11 10 7 7 6 6
Table 10.23: MCNC FSM Benchmarks Delay v/s Number of Contexts (Area Target)
193Best Area Ratio by Number of Context
FSM States Single Dense Encodings
Context 1 2 4 8 16 32 64
bbara 10 1.00 1.00 0.83 0.61 0.52 0.56 0.90 1.58
bbsse 16 1.00 1.36 0.85 0.51 0.49 0.56 0.90 1.58
bbtas 6 1.00 1.00 0.78 0.90 1.16 1.67 2.68 4.71
beecount 7 1.00 1.00 0.86 0.63 0.81 1.17 1.88 3.30
cse 16 1.00 1.23 0.73 0.53 0.37 0.42 0.68 1.19
dk14 7 1.00 1.00 0.41 0.17 0.22 0.32 0.52 0.91
dk15 4 1.00 1.00 0.30 0.35 0.45 0.65 1.05 1.85
dk16 27 1.00 2.02 0.78 0.43 0.16 0.23 0.38 0.66
dk17 8 1.00 1.00 0.34 0.40 0.51 0.74 1.19 2.08
dk512 15 1.00 1.05 0.38 0.44 0.57 0.82 1.31 2.31
donﬁle 24 1.00 3.52 1.35 0.85 0.21 0.30 0.49 0.86
ex1 20 1.00 1.61 0.77 0.62 0.53 0.60 0.81 1.43
ex4 14 1.00 1.00 1.04 0.84 1.00 1.44 2.32 4.08
ex6 8 1.00 1.46 0.59 0.38 0.31 0.45 0.72 1.27
keyb 19 1.00 1.54 0.63 0.42 0.32 0.35 0.47 0.82
mc 4 1.00 1.00 0.95 1.11 1.42 2.04 3.28 5.77
modulo12 12 1.00 1.00 0.45 0.53 0.68 0.97 1.56 2.75
planet 48 1.00 2.31 0.89 0.68 0.41 0.45 0.68 1.10
pma 24 1.00 1.00 1.04 0.70 0.47 0.51 0.69 1.21
s1 20 1.00 1.43 1.14 0.53 0.52 0.48 0.69 1.20
s1488 48 1.00 2.01 1.10 1.07 0.55 0.52 0.69 1.17
s1a 20 1.00 1.89 1.10 0.90 0.86 0.78 1.09 1.92
s208 18 1.00 1.45 1.00 0.47 0.38 0.49 0.69 1.22
s27 6 1.00 1.00 1.09 1.01 1.30 1.87 3.00 5.28
s386 13 1.00 1.52 0.91 0.57 0.50 0.67 1.07 1.88
s420 18 1.00 1.35 0.90 0.47 0.41 0.47 0.66 1.15
s510 47 1.00 2.46 1.92 0.82 0.54 0.56 0.90 1.59
s8 5 1.00 1.00 1.18 0.95 0.54 0.78 1.25 2.20
s820 25 1.00 2.66 1.09 0.85 0.79 0.81 1.22 2.15
sand 32 1.00 2.01 0.85 0.68 0.40 0.43 0.63 1.11
sse 16 1.00 1.36 0.85 0.51 0.49 0.56 0.90 1.58
styr 30 1.00 2.08 0.78 0.46 0.35 0.35 0.42 0.74
tbk 32 1.00 1.51 0.44 0.28 0.23 0.24 0.36 0.64
average 1.00 1.51 0.86 0.63 0.56 0.70 1.09 1.92
Table 10.24: MCNC FSM Benchmarks Area Ratio v/s Number of Contexts (Area Target)
194Best Delay Reduction by Number of Context
FSM States Single Dense Encodings
Context 1 2 4 8 16 32 64
bbara 10 0 0 2 3 3 5 5 5
bbsse 16 0 -2 -2 0 1 1 1 1
bbtas 6 0 0 2 2 2 2 2 2
beecount 7 0 0 0 3 3 3 3 3
cse 16 0 -2 0 1 1 4 4 4
dk14 7 0 0 0 3 3 3 3 3
dk15 4 0 0 11 11 11 11 11 11
dk16 27 0 1 0 -3 4 4 4 4
dk17 8 0 0 5 5 5 5 5 5
dk512 15 0 -8 1 1 1 1 1 1
donﬁle 24 0 -2 -4 -3 1 1 1 1
ex1 20 0 3 0 0 2 3 5 5
ex4 14 0 0 4 5 6 6 6 6
ex6 8 0 -3 -1 1 4 4 4 4
keyb 19 0 3 0 2 3 2 3 3
mc 4 0 0 1 1 1 1 1 1
modulo12 12 0 0 5 5 5 5 5 5
planet 48 0 2 0 1 1 2 4 5
pma 24 0 0 -2 1 1 3 4 4
s1 20 0 0 0 -1 0 1 0 0
s1488 48 0 -3 -1 1 1 1 3 3
s1a 20 0 0 -1 0 0 0 -2 -2
s208 18 0 0 -2 1 1 2 3 3
s27 6 0 0 0 1 1 1 1 1
s386 13 0 0 0 1 2 3 3 3
s420 18 0 -2 -2 -1 0 1 2 2
s510 47 0 -1 0 2 2 4 4 4
s8 5 0 0 1 1 3 3 3 3
s820 25 0 1 -1 -1 0 2 3 3
sand 32 0 2 0 1 1 2 2 2
sse 16 0 -2 -2 0 1 1 1 1
styr 30 0 2 0 0 1 1 3 3
tbk 32 0 2 -3 -2 1 1 2 2
average 0.00 -0.27 0.33 1.27 2.18 2.70 3.03 3.06
Table 10.25: MCNC FSM Benchmarks Delta Delay v/s Number of Contexts (Area Target)
195Tables 10.21 through 10.25 show the benchmark set mapped to variousnumbers of context im-
plementationswith delayminimizationas thetarget. All partitioningis performedalongmustang
state bits. For these results, we examined all possible state bits along which to split and chose the
best set. On average across the benchmark set, the 8 context mapping are half the area of the best
single-context case while achieving comparable delay.
196Best Delay by Number of Context
FSM States Single Dense Encodings
Context 1 2 4 8 16 32 64
bbara 10 3 4 3 3 3 1 1 1
bbsse 16 3 4 4 3 3 2 2 2
bbtas 6 2 3 1 1 1 1 1 1
beecount 7 2 3 2 1 1 1 1 1
cse 16 4 5 4 4 4 2 2 2
dk14 7 3 4 3 1 1 1 1 1
dk15 4 3 5 1 1 1 1 1 1
dk16 27 3 4 4 5 1 1 1 1
dk17 8 2 4 1 1 1 1 1 1
dk512 15 2 5 1 1 1 1 1 1
donﬁle 24 2 4 5 5 1 1 1 1
ex1 20 4 4 4 4 3 3 2 2
ex4 14 2 3 3 2 1 1 1 1
ex6 8 3 4 4 3 1 1 1 1
keyb 19 4 4 6 5 4 4 3 3
mc 4 2 2 1 1 1 1 1 1
modulo12 12 1 3 1 1 1 1 1 1
planet 48 4 4 5 3 3 3 2 1
pma 24 4 4 5 4 3 3 2 2
s1 20 4 5 5 4 4 3 3 3
s1488 48 4 5 4 7 4 3 3 2
s1a 20 3 5 6 5 5 4 4 4
s208 18 3 4 5 3 2 2 1 1
s27 6 2 2 2 1 1 1 1 1
s386 13 4 4 4 3 2 2 2 2
s420 18 3 5 5 3 2 2 1 1
s510 47 3 4 4 3 2 1 1 1
s8 5 2 2 3 3 1 1 1 1
s820 25 3 5 4 4 4 3 3 3
sand 32 4 5 5 4 4 4 3 3
sse 16 3 4 4 3 3 2 2 2
styr 30 5 5 4 4 4 4 3 3
tbk 32 5 6 7 6 6 5 4 4
Table 10.26: MCNC FSM Benchmarks Delay v/s Number of Contexts (Delay Target)
197Best LUTs by Number of Context
FSM States Single Dense Encodings
Context 1 2 4 8 16 32 64
bbara 10 40 33 24 12 8 6 6 6
bbsse 16 60 85 53 24 15 14 14 14
bbtas 6 9 7 5 5 5 5 5 5
beecount 7 19 18 12 7 7 7 7 7
cse 16 97 156 83 36 22 15 15 15
dk14 7 67 58 26 8 8 8 8 8
dk15 4 37 38 7 7 7 7 7 7
dk16 27 83 162 82 35 8 8 8 8
dk17 8 26 31 6 6 6 6 6 6
dk512 15 20 52 7 7 7 7 7 7
donﬁle 24 46 162 199 31 6 6 6 6
ex1 20 151 193 136 80 47 32 26 26
ex4 14 25 28 20 14 13 13 13 13
ex6 8 62 97 42 18 11 11 11 11
keyb 19 150 173 202 37 22 22 26 26
mc 4 8 8 7 7 7 7 7 7
modulo12 12 13 21 5 5 5 5 5 5
planet 48 172 346 202 93 38 30 27 25
pma 24 139 97 148 67 33 18 15 15
s1 20 195 196 144 80 56 40 30 30
s1488 48 183 455 264 99 57 35 28 28
s1a 20 107 136 73 51 38 33 30 30
s208 18 40 55 64 14 11 8 7 7
s27 6 5 5 5 4 4 4 4 4
s386 13 54 84 46 20 15 12 12 12
s420 18 40 54 33 20 11 8 7 7
s510 47 76 185 97 35 20 13 13 13
s8 5 13 13 13 9 4 4 4 4
s820 25 137 245 154 90 60 39 30 30
sand 32 224 358 219 130 67 61 43 43
sse 16 60 85 53 24 15 14 14 14
styr 30 285 387 211 129 60 34 23 23
tbk 32 510 513 676 353 71 52 42 42
Table 10.27: MCNC FSM Benchmarks LUTs v/s Number of Contexts (Delay Target)
198Best Area [M 2] by Number of Context
FSM States Single Dense Encodings
Context 1 2 4 8 16 32 64
bbara 10 35.1 29.0 22.9 13.3 11.4 12.3 19.8 34.8
bbsse 16 52.7 74.6 50.7 26.7 21.4 28.7 46.1 81.1
bbtas 6 7.9 6.1 4.8 5.6 7.1 10.2 16.5 29.0
beecount 7 16.7 15.8 11.5 7.8 10.0 14.3 23.1 40.5
cse 16 85.2 137.0 79.3 40.0 31.3 30.7 49.4 86.9
dk14 7 58.8 50.9 24.9 8.9 11.4 16.4 26.4 46.3
dk15 4 32.5 33.4 6.7 7.8 10.0 14.3 23.1 40.5
dk16 27 72.9 142.2 78.4 38.9 11.4 16.4 26.4 46.3
dk17 8 22.8 27.2 5.7 6.7 8.5 12.3 19.8 34.8
dk512 15 17.6 45.7 6.7 7.8 10.0 14.3 23.1 40.5
donﬁle 24 40.4 142.2 190.2 34.5 8.5 12.3 19.8 34.8
ex1 20 132.6 169.5 130.0 89.0 66.9 65.5 85.7 150.6
ex4 14 21.9 24.6 19.1 15.6 18.5 26.6 42.8 75.3
ex6 8 54.4 85.2 40.2 20.0 15.7 22.5 36.3 63.7
keyb 19 131.7 151.9 193.1 41.1 31.3 45.1 85.7 150.6
mc 4 7.0 7.0 6.7 7.8 10.0 14.3 23.1 40.5
modulo12 12 11.4 18.4 4.8 5.6 7.1 10.2 16.5 29.0
planet 48 151.0 303.8 193.1 103.4 54.1 61.4 89.0 144.8
pma 24 122.0 85.2 141.5 74.5 47.0 36.9 49.4 86.9
s1 20 171.2 172.1 137.7 89.0 79.7 81.9 98.9 173.8
s1488 48 160.7 399.5 252.4 110.1 81.2 71.7 92.3 162.2
s1a 20 93.9 119.4 69.8 56.7 54.1 67.6 98.9 173.8
s208 18 35.1 48.3 61.2 15.6 15.7 16.4 23.1 40.5
s27 6 4.4 4.4 4.8 4.4 5.7 8.2 13.2 23.2
s386 13 47.4 73.8 44.0 22.2 21.4 24.6 39.6 69.5
s420 18 35.1 47.4 31.5 22.2 15.7 16.4 23.1 40.5
s510 47 66.7 162.4 92.7 38.9 28.5 26.6 42.8 75.3
s8 5 11.4 11.4 12.4 10.0 5.7 8.2 13.2 23.2
s820 25 120.3 215.1 147.2 100.1 85.4 79.9 98.9 173.8
sand 32 196.7 314.3 209.4 144.6 95.4 124.9 141.7 249.1
sse 16 52.7 74.6 50.7 26.7 21.4 28.7 46.1 81.1
styr 30 250.2 339.8 201.7 143.4 85.4 69.6 75.8 133.2
tbk 32 447.8 450.4 646.3 392.5 101.1 106.5 138.4 243.3
Table 10.28: MCNC FSM Benchmarks Area v/s Number of Contexts (Time Target)
199Best Delay Reduction by Number of Context
FSM States Single Dense Encodings
Context 1 2 4 8 16 32 64
bbara 10 0 0 0 0 0 0 0 0
bbsse 16 0 -1 -1 0 0 1 1 1
bbtas 6 0 -1 1 1 1 1 1 1
beecount 7 0 -1 0 1 1 1 1 1
cse 16 0 -1 0 0 0 2 2 2
dk14 7 0 -1 0 2 2 2 2 2
dk15 4 0 -2 2 2 2 2 2 2
dk16 27 0 0 0 0 0 0 0 0
dk17 8 0 -2 1 1 1 1 1 1
dk512 15 0 -3 1 1 1 1 1 1
donﬁle 24 0 -2 -3 -3 1 1 1 1
ex1 20 0 0 0 0 1 1 2 2
ex4 14 0 -1 -1 0 1 1 1 1
ex6 8 0 -1 -1 0 2 2 2 2
keyb 19 0 0 -2 -1 0 0 1 1
mc 4 0 0 1 1 1 1 1 1
modulo12 12 0 -2 0 0 0 0 0 0
planet 48 0 0 -1 1 1 1 2 3
pma 24 0 0 -1 0 1 1 2 2
s1 20 0 -1 -1 0 0 1 1 1
s1488 48 0 -1 0 -3 0 1 1 2
s1a 20 0 -2 -3 -2 -2 -1 -1 -1
s208 18 0 -1 -2 0 1 1 2 2
s27 6 0 0 0 1 1 1 1 1
s386 13 0 0 0 1 2 2 2 2
s420 18 0 -2 -2 0 1 1 2 2
s510 47 0 -1 -1 0 1 2 2 2
s8 5 0 0 -1 -1 1 1 1 1
s820 25 0 -2 -1 -1 -1 0 0 0
sand 32 0 -1 -1 0 0 0 1 1
sse 16 0 -1 -1 0 0 1 1 1
styr 30 0 0 1 1 1 1 2 2
average 0.00 -0.91 -0.48 0.06 0.64 0.91 1.15 1.21
Table 10.29: MCNC FSM Benchmarks Delta Delay v/s Number of Contexts (Delay Target)
200Best Area Ratio by Number of Context
FSM States Single Dense Encodings
Context 1 2 4 8 16 32 64
bbara 10 1.00 0.83 0.65 0.38 0.32 0.35 0.56 0.99
bbsse 16 1.00 1.42 0.96 0.51 0.41 0.54 0.88 1.54
bbtas 6 1.00 0.78 0.60 0.70 0.90 1.30 2.09 3.66
beecount 7 1.00 0.95 0.69 0.47 0.60 0.86 1.38 2.43
cse 16 1.00 1.61 0.93 0.47 0.37 0.36 0.58 1.02
dk14 7 1.00 0.87 0.42 0.15 0.19 0.28 0.45 0.79
dk15 4 1.00 1.03 0.21 0.24 0.31 0.44 0.71 1.25
dk16 27 1.00 1.95 1.08 0.53 0.16 0.22 0.36 0.64
dk17 8 1.00 1.19 0.25 0.29 0.37 0.54 0.87 1.52
dk512 15 1.00 2.60 0.38 0.44 0.57 0.82 1.31 2.31
donﬁle 24 1.00 3.52 4.71 0.85 0.21 0.30 0.49 0.86
ex1 20 1.00 1.28 0.98 0.67 0.50 0.49 0.65 1.14
ex4 14 1.00 1.12 0.87 0.71 0.84 1.21 1.95 3.43
ex6 8 1.00 1.56 0.74 0.37 0.29 0.41 0.67 1.17
keyb 19 1.00 1.15 1.47 0.31 0.24 0.34 0.65 1.14
mc 4 1.00 1.00 0.95 1.11 1.42 2.04 3.28 5.77
modulo12 12 1.00 1.62 0.42 0.49 0.62 0.90 1.44 2.54
planet 48 1.00 2.01 1.28 0.68 0.36 0.41 0.59 0.96
pma 24 1.00 0.70 1.16 0.61 0.39 0.30 0.41 0.71
s1 20 1.00 1.01 0.80 0.52 0.47 0.48 0.58 1.01
s1488 48 1.00 2.49 1.57 0.69 0.51 0.45 0.57 1.01
s1a 20 1.00 1.27 0.74 0.60 0.58 0.72 1.05 1.85
s208 18 1.00 1.38 1.74 0.44 0.45 0.47 0.66 1.15
s27 6 1.00 1.00 1.09 1.01 1.30 1.87 3.00 5.28
s386 13 1.00 1.56 0.93 0.47 0.45 0.52 0.83 1.47
s420 18 1.00 1.35 0.90 0.63 0.45 0.47 0.66 1.15
s510 47 1.00 2.43 1.39 0.58 0.43 0.40 0.64 1.13
s8 5 1.00 1.00 1.09 0.88 0.50 0.72 1.16 2.03
s820 25 1.00 1.79 1.22 0.83 0.71 0.66 0.82 1.44
sand 32 1.00 1.60 1.06 0.74 0.49 0.64 0.72 1.27
sse 16 1.00 1.42 0.96 0.51 0.41 0.54 0.88 1.54
styr 30 1.00 1.36 0.81 0.57 0.34 0.28 0.30 0.53
tbk 32 1.00 1.01 1.44 0.88 0.23 0.24 0.31 0.54
average 1.00 1.45 1.05 0.59 0.50 0.62 0.95 1.67
Table 10.30: MCNC FSM Benchmarks Area Ratio v/s Number of Contexts (Delay Target)
201Dout
Acyc
myAddr
Read state
32x3 Memory
Figure 10.28: Memory-based Implementation for Simple FSM Example
10.6.4 Comparison with Memory-based FSM Implementations
A memory and a state register can also be used to implement ﬁnite-state machines. The
data inputs and current state are packed together and used as addresses into the memory, and the
memory outputs serve as machine outputs and next state outputs. Figure 10.28 shows a memory
implementation of our simple FSM example from Figures 10.24 and 10.25.
Used for ﬁnite-state machines, the DPGA is a hybrid between a purely gate (FPGA) imple-
mentation and a purely memory implementation. The DPGA takes advantage of the memory to
realize smaller, state-speciﬁc logic than an FPGA which must implement all logic simultaneously.
The DPGA uses the restructurable interconnect in the array to implement next-state and output
computationsout of gates. As we noticed in Section 4.5, the gate implementation allow the DPGA
to exploit regularities in the computational task. In this case, we avoid the necessarily exponential
areaincrease associatedwith additionalinputs in a memory implementation,thelinear-logincrease
associated with additional states, and the linear increase associated with additional outputs.
Assuming we could build just the right sized memory for a given FSM, the area would be:
2 log2 ( )
Table 10.31 summarizes the areas of the best memory-based FSM implementations along with the
areas for FPGA and 8-context DPGA implementations. The “Min area” column indicates the area
assuminga memory of exactly the right size is used, whilethe “MemoryArea” ﬁeld shows the area
for the smallest memory with an integral number of address bits as shown in the “organization”
column. When the total number of state bits and input bits is less than 11, the optimal memory
implementations can be much smaller than the FPGA or DPGA implementation. Above 11 input
andstatebits,theDPGAimplementationissmaller. SincetheDPGAimplementationsizeincreases
withtaskcomplexityratherthannumberofinputs,whilethememoryimplementationisexponential
in the number of inputs and state bits, the disparity grows substantially as the number of inputs and
state bits increase.
202Min Integral Memory FPGA 8-ctx DPGA
FSM states ins outs area Addr. & Data area area area
[M 2] Organization [M 2] [M 2] [M 2]
bbtas 6 2 2 0.1 25 5 0.2 6.1 7.1
dk15 4 3 5 0.3 25 7 0.3 21.9 10.0
dk17 8 2 3 0.2 25 6 0.2 16.7 8.5
dk512 15 1 3 0.3 25 7 0.3 17.6 10.0
mc 4 3 5 0.3 25 7 0.3 7.0 10.0
modulo12 12 1 1 0.1 25 5 0.2 10.5 7.1
beecount 7 3 4 0.5 26 7 0.5 12.3 10.0
dk14 7 3 5 0.5 26 8 0.6 50.9 11.4
dk16 27 2 3 1.0 27 8 1.3 70.2 11.4
donﬁle 24 2 1 0.7 27 6 0.9 40.4 8.5
s27 6 4 1 0.5 27 4 0.6 4.4 5.7
s8 5 4 1 0.4 27 4 0.6 10.5 5.7
bbara 10 4 2 1.2 28 6 1.8 21.9 11.4
ex6 8 5 8 3.4 28 11 3.4 50.0 15.7
ex4 14 6 9 14.0 210 13 16.0 18.4 18.5
bbsse 16 7 7 27.0 211 11 27.0 43.9 21.4
cse 16 7 7 27.0 211 11 27.0 72.9 27.1
tbk 32 6 3 19.7 211 8 19.7 298.5 68.4
sse 16 7 7 27.0 211 11 27.0 43.9 21.4
s386 13 7 7 22.9 211 11 27.0 36.9 18.5
keyb 19 7 2 20.4 212 7 34.4 98.3 31.3
planet 48 7 19 184.3 213 25 245.8 131.7 54.1
pma 24 8 8 95.8 213 13 127.8 72.0 34.2
s1 20 8 6 67.6 213 11 108.1 120.3 62.7
s1a 20 8 6 67.6 213 11 108.1 63.2 54.1
ex1 20 9 19 294.9 214 24 471.9 105.4 55.5
s1488 48 8 19 368.6 214 25 491.5 133.5 74.0
styr 30 9 10 276.5 214 15 294.9 163.3 57.0
s208 18 11 2 309.7 216 7 550.5 33.4 12.8
sand 32 11 9 1101.0 216 14 1101.0 156.3 62.7
s820 25 18 19 188743.7 223 24 241591.9 80.8 64.1
s420 18 19 2 79272.3 224 7 140928.6 35.1 14.2
s510 47 19 7 384408.9 225 13 523449.1 47.4 25.6
N.b. – benchmarks reordered by the sum of the number of inputs and densely encoded state bits
Table 10.31: Memory Implementations for MCNC FSM Benchmarks
20310.6.5 Areas for Improvement
Timing In this section,we assumedthat the contextread occurredin series with executionwithin
the target context and state. It is possible to overlap context reads with execution by using a more
sophisticated FSM mapping model. On state transition, instead of reading a context with the target
state logic, we read a contextwith all the logic for any state which may follow the target state. This
can be viewed as speculativelyfetching just the set of logic which may be needed by the time it has
been read a cycle later. Using this scheme, we can reduce the time to the actual delay in the context
rather than the context delay plus the read time. For heavily branching FSMs, the target logic will
often have to include more state logic per context than with this style of mapping than it was with
the simple division described here. As we see here, including more state logic increases the delay
so it is not immediately obvious which case will generally have superior performance.
Partitioning For the partial temporal partitioning above, we partitioned strictly along mustang
state bits. This is likely to give less than optimal partitions since mustang’s cost model is aimed
atmulti-levellogicimplementations. Itassumes alllogic mustbeavailableat onceandis not trying
to maximize the independence among state groups. A more sophisticated mapping would go back
to the original state-transition graph and partition states explicitly to minimize the logic required
in each partition. Informally, the goal would be to group states with similar logic together and
separate states performing disparate logical functions.
10.6.6 General Technique
While demonstrated in the contexts of FSMs, the basic technique used here is fairly general.
Whenwecan predictwhichportionsofanetlist, circuit, orcomputationaltaskareneeded ata given
point in time, we can generate a more specialized design which only includes the required logic.
Thespecializeddesign is oftensmallerandfaster thanthe fullygeneraldesign. Witha multicontext
component, we can use the contexts to hold many specialized variants of a design, selecting them
as needed.
In the synthesis of computations, it is common to synthesize a controller along with datapaths
or computational elements. The computations required are generally different from state to state.
Traditional, spatial implementations must have hardware to handle all the computations or gener-
alize a common datapath to the point where it can be used for all the computations. Multicontext
devices can segregatedifferentcomputationalelements into differentcontexts and select them only
as needed.
For example, in both dbC [GM93] and PRISM-II [AWG94] a general datapath capable of
handling all of the computational subtasks in computation is synthesized alongside the controller.
At any point in time, however, only a small portion of the functionality contained in the datapath is
actually needed and enabled by the controller. The multicontext implementation would be smaller
by folding the disparate logic into memory and reusing the same active logic and interconnect to
implement them as needed.
20410.7 Additional Application Styles
10.7.1 Multifunction Components
With multiple,on-chipcontexts, adevice maybe loadedwithseveral differentfunctions,anyof
whichis immediately accessible with minimal overhead. A DPGA can thus act as a “multifunction
peripheral,” performing distinct tasks without idling for long reconﬁguration intervals. In a system
such as the one shown in Figure 10.7, a single device may perform several tasks. When used
as a reconﬁgurable accelerator for a processor (e.g. [AS93] [DeH94] [RS94]) or to implement a
dynamic processor (e.g. [WH95]), the DPGA can support multiple loaded acceleration functions
simultaneously. The DPGA is more efﬁcient in these allocations than single-context FPGAs
because it allows rapid reuse of resources without paying the large idle-time overheads associated
with reconﬁguration from off-chip memory.
In a data storage or transmission application, for instance, one may be limited by the network
or disk bandwidth. A single device may be loaded with functions to perform:
(De)compression
Cryptographic (e.g. DES) (de)encoding
ECC Calculation, error detection, and correction
The device would be then called upon to perform the required tasks as needed.
Within a CAD application,such as espresso[RSV87], one needs to perform several distinct
operations at different times, each of which could be accelerated with reconﬁgurable logic. We
could load the DPGA with assist functions, such as:
ASCII decoding (e.g. [Raz94])
Bitvector manipulation
Find ﬁrst one (e.g. [AS93])
Hamming distance calculation (e.g. [AS93])
Since these tasks are needed at distinct times, they can easily be stacked in separate contexts.
Contexts are selected as the program needs these functions. To the extent that function usage is
interleaved,theon-chip contextconﬁgurationsreducethereload idle time whichwould be required
to share a conventional device among as diverse a set of functions.
10.7.2 Utility Functions
Some classes of functionality are needed, occasionally but not continuously. In conventional
systems, to get the functionality at all, we have to dedicate wire or gate capacity to such functions,
even though they may be used very infrequently. A variety of data loading and unloading tasks ﬁt
into this “infrequent use” category, including:
Data ofﬂoad
Debugging snapshot
Testing observability
Fault recovery snapshot
Context data ofﬂoad
Data onload
205Conﬁguration setting
Value initialization
Debugging value injection
Testing accessibility
Fault recovery
Context data reload (after coarse-grain context switch)
Operation idle/enable
Conditional operation
Exception handling
Stall
In a multicontext DPGA, the resources to handle these infrequent cases can be relegated to a
separate context, or contexts, from the “normal” case code. The wires and control required to shift
in(out)dataandloaditareallocatedforuseonlywhentherespectiveutilitycontextisselected. The
operative circuitry then, does not contend with the utility circuitry for wiring channels or switches,
and the utility functions do not complicate the operative logic. In this manner, the utility functions
can exist without increasing critical path delay during operation.
A relaxation algorithm, for instance, might operate as follows:
1. Load in starting point and boundary conditions
2. Calculate relaxation updates
3. Check for convergence, return to 2 if not converged
4. Ofﬂoad result
Each of these operations may be separate contexts. The relaxation computation may even be
spread over several contexts. This general operation style, where inputs and outputs are distinct
and infrequent phases of operation, is common for many kinds of operations (e.g. multi-round
encryption, hashing, searching, and many optimization problems).
10.7.3 Temporally Systolic Computations
Figure 10.29 shows a typical video coding pipeline (e.g. [JOSV95]). In a conventional FPGA
implementation, we would lay this pipeline out spatially, streaming data through the pipeline. If
we needed the throughput capacity offered by the most heavily pipelined spatial implementation,
that would be the design of choice. However, if we needed less throughput, the spatially pipelined
version would require the same space while underutilizing the silicon. In this case, a DPGA
implementation couldstack the pipeline stagesin time. The DPGA can execute a number of cycles
ononepipelinefunctionthenswitchtoanothercontextandexecuteafewcyclesonthenextpipeline
function (See Figure 10.30). In this manner, the lower throughput requirement could be translated
directly into lower device requirements.
Thisisthesamebasicorganizationalschemeusedforlevelizedlogicevaluation(Section10.5.1).
The primary difference being that evaluation levels are divided according to application subtasks.
This is a general schema with broad application. The pipeline design style is quite familiar
and can be readily adapted for multicontext implementation. The amount of temporal pipelining
can be varied as throughput requirements change or technology advances. As silicon feature sizes
206  Motion
Estimation
Transformation
Quantization
Coding
Figure 10.29: Canonical Video Coding Pipeline
Transformation Quantization Coding
  Motion
Estimation
Time
context
 switch
context
 switch
context
 switch
context
 switch
Figure 10.30: Temporally Systolic Video Coding Pipeline
shrink, primitive device bandwidth increases. Operations with ﬁxed bandwidth requirements can
increasingly be compressed into more temporal and less spatial evaluation.
When temporally pipelining functions, the data ﬂowing between functional blocks can be
transmitted in time rather than space. This saves routing resources by bringing the functionality to
the data rather than routing the data to the required functional block. By packing more functions
ontoa chip,temporalpipeliningcan alsohelpavoidcomponent I/Obandwidthbottlenecksbetween
function blocks.
207 Sub
Array
Context Select
Figure 10.31: Control Distribution on DPGA Prototype
10.8 Control
In the prototype DPGA (Section 10.4), we had a single, array-wide control thread, the context
select line, which was driven from off-chip. In general, as we noted in Chapter 8, the array may
be segregated into regions controlled by a number of distinct control threads. Further, in many
applications it will be beneﬁcial to control execution on-chip – perhaps even from portions of the
array, itself.
10.8.1 Segregation
In the prototype subarrays were used to organize local interconnect. The subarray level of
array decomposition can also be used to segregate independent control domains. As shown in
Figure 10.31, the context select lines were simply buffered and routed to each subarray. The actual
decoding to control memories in the prototype occured in the local decode block. We can control
thesubarrays independently byproviding distinctsets of control lines for eachsubarrays, or groups
of subarrays.
208 Sub
Array
Control
   One
Control
   Two
Control
  Three
Figure 10.32: Multiple Controllers – Hardwired Control
10.8.2 Distribution
Hardwired Control In the simplest case, separate control streams can be physically assigned to
each subarray or subarray group. For example, Figure 10.32 shows a 3 3 subarray design with a
separate control stream for each column.
ConﬁgurableControl Alternately,multiplecontrolstreamscanbephysicallyroutedtoeachsub-
arraywith localconﬁguration used to select the appropriate one for use. For example, Figure 10.33
shows a 3 3 subarray design with three controllers where each subarray can be conﬁgured to
select any of the three control streams. The conﬁgurable control may even be integrated with the
conﬁgurable interconnection network.
Metaconﬁguration In scenarios where array control can be conﬁgured it will often be necessary
to have a separate level of conﬁguration from the array itself. This meta-level conﬁguration is used
to deﬁne the sources for control data and perhaps control distribution paths. It does not change
from cycle-to-cycle as does regular array conﬁguration data. The MATRIX design described in
Chapter 13 deals explicitly with this kind of a multi-level conﬁguration scheme.
209 Sub
Array
Control
   One
Control
   Two
Control
  Three
Figure 10.33: Multiple Controllers – Conﬁgurable Control
10.8.3 Source
Off-Chip The controlstream can besourced fromoff-chip, as in theprototype,providing consid-
erableﬂexibilityto theapplicationor system. Off-chip control,however, impliesadditionallatency
in the control path and the additional cost of a separate controller component. It also requires
precious i/o pins be dedicated to control rather than data. Tasks which beneﬁt from rapid feedback
between data in the computation stream and the control stream are hindered by the data control
path which most cross ﬁrst off-chip then back on-chip.
Local Dedicated Controller A dedicated, programmable controller can be integrated on-chip
to manage the control stream. The controller could come in the form of a simple counter, a
programmable PLA, a basic microcontroller, or a core microprocessor. Integrated on-chip, it has
low latency and high bandwidth to the array and avoids consuming i/o pins. In order to integrate
such controllers on chip, we must decide how much space to dedicate to them, how many separate
controllers to provide, and what form the controller will take. Recall from Section 8.5, that we
would like to match the number of control streams with the needs of the application, but we cannot
210 Sub
Array
Controller
Figure 10.34: Array Self Control Example
do that if the controllers must be allocated prior to fabrication.
Feedback Self Control For mapped FSMs (Section 10.6), we saw that it was beneﬁcial to route
some of the design outputs back into the control port (e.g. Figure 10.24). As noted above, this
entails some integration of the reconﬁgurable network and the control distribution path.
SelfControl We canalsobuildthe controllerout ofFPGA/DPGAlogic. Thecontrollergenerally
implements an FSM. It would be plausible, then to allocate one or more subarrays to build a
controller which is, in turn, used to control the other subarrays on the component. With this
scheme, we can partition the array and build just as many controllers as are required for the task at
hand. Figure 10.34 shows a case where two subarrays are used to build the controller which is then
responsible for controlling the rest of the array.
21110.9 Conclusions
Conventional FPGAs underutilize silicon in most situations. When throughput requirements
are limited, task latency is important, or when the computation required varies with time or the
data being processed, conventional designs leave much of the active LUTs and interconnect idle
for most of the time. Since the area to hold a conﬁguration is small compared to the active LUT
and interconnect area, we can generally meet task throughput or latency requirements with less
implementation area using a multicontext component.
In this section we introduced the DPGA, a bit-level computational array which stores several
instructions or conﬁgurations along with each active computing element. We described a com-
plete prototype of the architecture and some elementary design automation schemes for mapping
traditional circuits and FSMs onto these multicontext architectures.
We showed how to automatically map conventional tasks onto these multicontext devices. For
latencylimitedandlowthroughputtasks,a4-contextDPGAimplementationis,onaverage,20-40%
smallerthananFPGAimplementation. ForFSMs,the4-contextDPGAimplementationisgenerally
30-40% smaller than the FPGA implementation, while the 8-context DPGA implementation is 40-
50% smaller. Signal retiming requirements are the primary limitation which prevents the DPGA
from realizing greater savings on circuit benchmarks, so it is worthwhile to consider architectural
changes to support retiming in a less expensive manner. We will look at one such modiﬁcation in
the next chapter. All of these results are based on a context-memory area to active compute and
interconnect area ratio of 1:10. The smaller the context memories can be implemented relative to
the ﬁxed logic, the greater the reduction in implementation area we can achieve with multicontext
devices.
For hand-mapped designs or coarse-grain interleaving, the potential area savings is much
greater. With the 1:10 context memory to active ratio, the 4-context DPGA can be one-third the
sizeof an FPGAsupportingthesame numberof LUTs. An 8-contextDPGA can be 20%of thesize
of the FPGA. Several of the automatically mapped FSM examples come close to achieving these
area reductions.
21211. Dynamically Programmable Gate Arrays with Input Registers
InChapter10wenoticedthatretimingrequirementsoftenpreventedusfromrealizingassigniﬁcant
a reduction in active LUTs as should be possible. As a result of retiming, we often had to dedicate
active LUTs simply to pass data through intermediate contexts. Retiming requirments also created
a saturation level below which no further reduction in active LUTs was possible even if we were
willing to take more time or add more context memories.
In this chapter we introduce input registers to the simple DPGA model used in the previous
chapter. Theseinputregistersallowus to storevalueswhichneedto traverseLUTevaluationlevels
in memories rather than having them consume active resources during the period of time which
they are being retimed. This addition reduces the retiming limit we encountered in the previous
chapter.
We introduce input registers to the base DPGA architecture (Section 11.1) and expand our
computing device model accordingly (Section 11.2). Section 11.3 provides a basic example of the
beneﬁts of adding input registers. We expand our experimental, multicontext mapping software
fromthepreviouschapterto handleinputregisters(Section 11.4)and examinetheaggregateresults
of mappingcircuit benchmarksto these devices. In Section 11.5, we brieﬂy relate theinput register
model used in this chapter to potential alternatives. At the end of this chapter (Section 11.7) we
review the key points about multicontext devices as developed over the last several chapters.
11.1 Input Registers
We established in Chapter 7 that most of the active area in conventional FPGAs goes into
interconnect. When a signal must cross multiple succeeding contexts between the producer and
the ﬁnal consumer, in the existing model, we must dedicate precious, active routing resources to
the signal for all intervening contexts. Note that this property is essentially true of single context
FPGAs, as well. If a value is produced early in some critical path, but not consumed until several
LUT delays later, the wires and switches between the producer and consumer are tied up holding
the value for the entire time. Tying up switches and wires to transport a value in time is a poor use
of a scarce resource.
The conventional model results from storing values in registers on the output of each computa-
tional element (See Figures 11.1 and 11.2). With this arrangement, we must hold the value on the
output and tie up switches and wires between the producer and the consumer until such time as the
ﬁnal consumer has used the value. Since values are produced at different times, and several values
from different sources must converge at a consuming LUT in order for it to produce its output
value, this gives rise to the situation where switches and wires are forced to sit idle holding values
for much longer than the time it takes for them to transport the values from their sources to their
destinations.
The alternative is to move the value registers to the inputs of the computational elements (See
Figure 11.3). In the simplest case, this means having four ﬂip-ﬂops on the input of each 4-LUT
2132 2
     FPGA
Array Element
Figure 11.1: FPGA Array Element
     DPGA
Array Element
2 2
context
 select
Figure 11.2: DPGA Array Element
     iDPGA
Array Element
Figure 11.3: DPGA Array Element with Input Registers
214rather than one ﬂip-ﬂop on the output. This modiﬁcation allows us to move the data from the
producerto consumer in the minimum transit time – a time independentof when the consumer will
actually use the data. We now tie up space in a register to perform the retiming function ratherthan
tying up all the wires and interconnect required to route the value from producers to consumers.
Since the register can be much smaller than the intervening interconnect, this results in a tighter
implementation.
Conceptually, thekey idea here is that signal transportand retiming are two differentfunctions:
1. Spatial Transport – moves data in space – route data from source to destination
2. Temporal Transport (Retiming) – moves data in time – make data available at some later
time when it is actually required
Bysegregatingthemechanismsweuseforthesetwofunctions,wecanoptimizethemindependently
and achieve a tighter implementation.
We can view this multicontext progression as successively relaxing the strict interconnect
requirements for this class of devices:
In a traditional, single-context FPGA we must have enough wires and switches to simulta-
neously route all the connections in the entire task description graph.
In a prototype-style DPGA as described in the previous section, we must have enough LUT
outputs, switches, and wires to carry one temporal slice through the computation.
In a DPGA with input registers, in the extreme, we need only a single wire. More wires
facilitate more parallelism in transport and hence higher throughput and lower latency im-
plementations, but are not required for functionality.
11.2 iDPGA Model
A DPGA with input registers (iDPGA) associates an -bit long shift register with each LUT
input in addition to the instructions per active LUT. The LUT instruction tells the LUT which of
the values on the shift register to actually select on each cycle. Each LUT input can thus retime
a value by up to cycles. That is, values may arrive at the destination LUT up to clock cycles
before they are consumed. Figure 11.4 shows a possible iDPGA array element with 4 contexts and
an input register with depth 3.
The input registers do place a restriction on the grouping of logical LUTs into physical LUTs
which was not present in the original DPGA. Multiple LUTs cannot have inputs arriving at the
same input position on the same cycle. Fortunately, LUT input permutability often allows us to
rearrange the inputs to avoid such potential conﬂicts. Nonetheless, the restriction does complicate
LUT placement.
The additional resources required for this model are -additional register cells for each input
and one 1 multiplexor for each input. For a -LUT, the area then is:
(11.1)
800K 2
215context
 select
     iDPGA
Array Element
Figure 11.4: iDPGA Array Element 4, 3
78K 2
4K 2
2 5K 2
Composing areas for a 4-LUT, we have:
4 (11.2)
800K 2
78K 2
26K 2
Note here that we assume the total number of context description bits does not change. Rather,
the bits that indicate which of the inputs to select are bits which have been shufﬂed from spatial
routing to temporal routing. That is, this scheme reduces the spatial interconnect requirements by
performing temporal retiming in these registers. We are assuming that the bits are shufﬂed from
one task to another without any signiﬁcant change in the overall number of bits required.
11.3 Example
RecallfromSection10.1, thatourASCII hexbinarycircuitcouldbemappedtothreecontexts,
but could not, viably, be mapped to fewer contexts. By adding the -input register as suggested
above, the active LUT requirements continue to decline with throughput reductions. Figure 11.5
shows this same circuit mapped with varying input register depth. As the number of input registers
increases from 1 to 4, the saturation point reduces from 7 active LUTs to 4. Using our area model
216 FPGA
 DPGA (i=1)
 iDPGA (i=2)
 iDPGA (i=3)
 iDPGA (i=4) 
|
1
|
2
|
3
|
4
|
5
|
6
|
7
|
8
|
9
|
10
| 0
| 5
| 10
| 15
| 20
| 25
| 30
 Contexts
 
L
U
T
s
 FPGA
 DPGA (i=1)
 iDPGA (i=2)
 iDPGA (i=3)
 iDPGA (i=4) 
|
1
|
2
|
3
|
4
|
5
|
6
|
7
|
8
|
9
|
10
| 0
| 5
| 10
| 15
| 20
| 25
| 30
 Contexts
 
A
r
e
a
 
i
n
 
[
M
 
2
]
Figure 11.5: ASCII Hex Binary Implementation versus Contexts and Input Register Depth
from the previous section, the 4, 6 iDPGA is 5.5M 2, or over 3 smaller than the single
context FPGA implementation at 18.4M 2 and over 2 smaller than the smallest DPGA without
input registers at 12.5M 2.
21711.4 Circuit Benchmarks: Input Depth
To examine the merits of input registers, we return to our throughput optimized circuit bench-
marks as we originally visited in Section 10.5.3 for DPGAs. We use the same MCNC circuit
benchmark set and the same input netlists as synthesized and mapped by sis and Chortle.
Again, since we are assuming here that the target criteria is throughput, both sis and Chortle
netlists were synthesized in area mode. As before, no modiﬁcations to the mapping and netlist
generation are made.
11.4.1 Mapping
As before, we divide the multi-context case into separate spatial pipeline stages such that the
path length between pipeline registers is equal to the acceptable period between results. The LUTs
within a phase are then evaluated in multicontext fashion using the available contexts. The main
difference from Section 10.5.3 is the cost metric for retiming. Since each LUT can retime up to
cycles, we only charge for retiming registers every temporal stages between the original source
and the ﬁnal destination.
Whenwedoneedtoplaceretimingregisters, theyareplacedinastylizedfashion. Startingfrom
the ﬁnal consumer, we walk back through the circuit toward the primary inputs, placing a retiming
repeater LUT every th stage. In practice, we often have much more freedom in the placement of
retiming registers, but this freedom was not exploited in our experimental mapping tools. During
the annealing step, whenever the ﬁnal consumer for a particular value is moved, the retiming chain
is stripped out and replaced based on the consumer’s new location.
After all levelizationhas been done, a grouping pass is performed. The grouping pass attempts
to group together logical LUTs within a spatial partition to reside on one physical LUT. For a
group of LUTs to be compatible, it must be possible to permute the LUTs’ inputs such that no two
LUTs require a different value to arrive on the same input on the same clock cycle. Rather than
trying all (4!) 1 permutations, we use a randomized, greedy placement scheme. We randomly
pick which input in a LUT to place ﬁrst, then greedily place it in a non-conﬂicting location. Other
inputs within a LUT are placed sequentially after the initial random selection. The compatibility
routine will make several attempts to ﬁnd a satisfying assignment before declaring the grouping
incompatible.
Grouping is performed independently on each spatial partition. The grouping routine starts by
packing allthe logical LUTs in a spatial partition into the minimum number of physical LUTs – i.e.
thenumberofphysicalLUTsrequiredtoimplementthelargesttemporalstage. Theattemptismade
byﬁrstrandomlyassigninglogicalLUTstophysicalLUTs,thenrandomlyselectinglogicalLUTsto
swapinordertoreduceincompatibilityconﬂicts. Swapswhichdonotincreasetheincompatibilities
in the grouping are greedily accepted. Swapping continues until a satisfying set of groupings is
found or the swapping runs longer than a predetermined time limit which is proportional to the
number of logical LUTs which can be described in the spatial partition. When packing fails, we
increment the number of target physical LUTs and retry packing.
In review, circuit mapping proceeds through the following steps:
1. Technology Independent Optimization (sis)
2182. LUT Mapping (Chortle)
3. Spatial and Temporal Levelization (simulated annealing)
4. Physical LUT Grouping (greedy swapping with heuristic compatibility veriﬁcation)
219alu2 at 4 clocks/result throughput
LUTs by Number of Contexts ( )
1 2 3 4 5 6 7 8
(1) 240 207 161 161 161 161 161 161
2 149 104 104 104 104 104 104
3 81 81 81 81 81 81
4 81 81 81 81 81
5 79 79 79 79
6 79 79 79
7 78 78
8 78
Table 11.1: Total Physial LUTs Required to Implement alu2 Benchmark
alu2 at 4 clocks/result throughput
Area in M 2 by Number of Contexts ( )
1 2 3 4 5 6 7 8
(1) 210.7 208.7 174.8 187.4 200.0 212.5 225.1 237.6
2 150.2 112.9 121.1 129.2 137.3 145.4 153.5
3 90.1 96.4 102.7 109.0 115.3 121.7
4 98.5 104.8 111.1 117.4 123.8
5 104.3 110.4 116.6 122.8
6 112.5 118.7 124.8
7 119.2 125.3
8 127.3
Table 11.2: Total Area Required to Implement alu2 Benchmark
11.4.2 Detailed Example: alu2
Table 11.1 shows the total LUTs required after retiming and packing for the alu2 benchmark
mapped to provide a throughput of one result every four LUT delays. The table shows mappings
for various values of and . We constrain in the current mapping software, so there are
no conﬁgurations with . Up to 3, we see that each additional input register allows us to
further reduce the total number of physical LUTs required in the implementation. Table 11.2 uses
the area model from Section 11.2 to translate the LUT counts into areas, and Table 11.3 shows the
area savings versus a traditional FPGA implementation ( 1). The 3, 3, iDPGA
implementation is smallest at 43% of the area of the FPGA implementation.
Figure 11.6 shows the area of the family of alu2 implementations as a function of context ( )
andinput ( ) depth. Figure 11.7plots theareas as ratiosversusthe FPGAimplementation. The ﬁrst
220alu2 at 4 clocks/result throughput
Ratio by Number of Contexts ( )
1 2 3 4 5 6 7 8
1 1.000 0.991 0.830 0.889 0.949 1.009 1.068 1.128
2 0.713 0.536 0.575 0.613 0.652 0.690 0.729
3 0.428 0.458 0.487 0.517 0.547 0.578
4 0.467 0.497 0.527 0.557 0.588
5 0.495 0.524 0.553 0.583
6 0.534 0.563 0.592
7 0.566 0.595
8 0.604
Table 11.3: Area Ratios for alu2 Benchmark Implementation
couple of input registers( goes from 1 2 and 2 3) show signiﬁcantgains for this benchmark.
Gains diminish for greater input register depth. The best implementations are one-third the size of
the FPGA implementation.
221|
1
|
3
|
5
|
7
|
9
|
11
|
13
|
15
|
17
|
19
| 0
| 100
| 200
| 300
| 400
| 500
| 600
| 700
| 800
| 900
 Throughput Target (cycles/result)
C=1
 
A
r
e
a
 
i
n
 
[
M
 
2
]
|
1
|
3
|
5
|
7
|
9
|
11
|
13
|
15
|
17
|
19
| 0
| 100
| 200
| 300
| 400
| 500
| 600
| 700
| 800
| 900
 Throughput Target (cycles/result)
C=2
 
A
r
e
a
 
i
n
 
[
M
 
2
]
|
1
|
3
|
5
|
7
|
9
|
11
|
13
|
15
|
17
|
19
| 0
| 100
| 200
| 300
| 400
| 500
| 600
| 700
| 800
| 900
 Throughput Target (cycles/result)
C=3
 
A
r
e
a
 
i
n
 
[
M
 
2
]
|
1
|
3
|
5
|
7
|
9
|
11
|
13
|
15
|
17
|
19
| 0
| 100
| 200
| 300
| 400
| 500
| 600
| 700
| 800
| 900
 Throughput Target (cycles/result)
C=4
 
A
r
e
a
 
i
n
 
[
M
 
2
]
|
1
|
3
|
5
|
7
|
9
|
11
|
13
|
15
|
17
|
19
| 0
| 100
| 200
| 300
| 400
| 500
| 600
| 700
| 800
| 900
 Throughput Target (cycles/result)
C=5
 
A
r
e
a
 
i
n
 
[
M
 
2
]
|
1
|
3
|
5
|
7
|
9
|
11
|
13
|
15
|
17
|
19
| 0
| 100
| 200
| 300
| 400
| 500
| 600
| 700
| 800
| 900
 Throughput Target (cycles/result)
C=6
 
A
r
e
a
 
i
n
 
[
M
 
2
]
|
1
|
3
|
5
|
7
|
9
|
11
|
13
|
15
|
17
|
19
| 0
| 100
| 200
| 300
| 400
| 500
| 600
| 700
| 800
| 900
 Throughput Target (cycles/result)
C=7
 
A
r
e
a
 
i
n
 
[
M
 
2
]
 i=1
 i=2
 i=3
 i=4
 i=5
 i=6
 i=7
 i=8
|
1
|
3
|
5
|
7
|
9
|
11
|
13
|
15
|
17
|
19
| 0
| 100
| 200
| 300
| 400
| 500
| 600
| 700
| 800
| 900
 Throughput Target (cycles/result)
C=8
 
A
r
e
a
 
i
n
 
[
M
 
2
]
Figure 11.6: alu2 Implementation Area versus Throughput
222|
1
|
3
|
5
|
7
|
9
|
11
|
13
|
15
|
17
|
19
| 0.0
| 0.2
| 0.4
| 0.6
| 0.8
| 1.0
| 1.2
| 1.4
| 1.6
| 1.8
| 2.0
 Throughput Target (cycles/result)
C=1
 
R
a
t
i
o
 
A
i
D
P
G
A
/
A
F
P
G
A
|
1
|
3
|
5
|
7
|
9
|
11
|
13
|
15
|
17
|
19
| 0.0
| 0.2
| 0.4
| 0.6
| 0.8
| 1.0
| 1.2
| 1.4
| 1.6
| 1.8
| 2.0
 Throughput Target (cycles/result)
C=2
 
R
a
t
i
o
 
A
i
D
P
G
A
/
A
F
P
G
A
|
1
|
3
|
5
|
7
|
9
|
11
|
13
|
15
|
17
|
19
| 0.0
| 0.2
| 0.4
| 0.6
| 0.8
| 1.0
| 1.2
| 1.4
| 1.6
| 1.8
| 2.0
 Throughput Target (cycles/result)
C=3
 
R
a
t
i
o
 
A
i
D
P
G
A
/
A
F
P
G
A
|
1
|
3
|
5
|
7
|
9
|
11
|
13
|
15
|
17
|
19
| 0.0
| 0.2
| 0.4
| 0.6
| 0.8
| 1.0
| 1.2
| 1.4
| 1.6
| 1.8
| 2.0
 Throughput Target (cycles/result)
C=4
 
R
a
t
i
o
 
A
i
D
P
G
A
/
A
F
P
G
A
|
1
|
3
|
5
|
7
|
9
|
11
|
13
|
15
|
17
|
19
| 0.0
| 0.2
| 0.4
| 0.6
| 0.8
| 1.0
| 1.2
| 1.4
| 1.6
| 1.8
| 2.0
 Throughput Target (cycles/result)
C=5
 
R
a
t
i
o
 
A
i
D
P
G
A
/
A
F
P
G
A
|
1
|
3
|
5
|
7
|
9
|
11
|
13
|
15
|
17
|
19
| 0.0
| 0.2
| 0.4
| 0.6
| 0.8
| 1.0
| 1.2
| 1.4
| 1.6
| 1.8
| 2.0
 Throughput Target (cycles/result)
C=6
 
R
a
t
i
o
 
A
i
D
P
G
A
/
A
F
P
G
A
|
1
|
3
|
5
|
7
|
9
|
11
|
13
|
15
|
17
|
19
| 0.0
| 0.2
| 0.4
| 0.6
| 0.8
| 1.0
| 1.2
| 1.4
| 1.6
| 1.8
| 2.0
 Throughput Target (cycles/result)
C=7
 
R
a
t
i
o
 
A
i
D
P
G
A
/
A
F
P
G
A
 i=1
 i=2
 i=3
 i=4
 i=5
 i=6
 i=7
 i=8
|
1
|
3
|
5
|
7
|
9
|
11
|
13
|
15
|
17
|
19
| 0.0
| 0.2
| 0.4
| 0.6
| 0.8
| 1.0
| 1.2
| 1.4
| 1.6
| 1.8
| 2.0
 Throughput Target (cycles/result)
C=8
 
R
a
t
i
o
 
A
i
D
P
G
A
/
A
F
P
G
A
Figure 11.7: alu2 Area Ratios versus Throughput
223Average Ratio at 1 clock/result throughput
Ratio by Number of Contexts ( )
1 2 3 4 5 6 7 8
1 1.000 1.148 1.237 1.326 1.415 1.503 1.592 1.681
2 1.148 1.237 1.326 1.415 1.503 1.592 1.681
3 1.267 1.355 1.444 1.533 1.622 1.711
4 1.385 1.474 1.563 1.651 1.740
5 1.503 1.592 1.681 1.770
6 1.622 1.711 1.800
7 1.740 1.829
8 1.859
Table 11.4: Average Ratios for Benchmark Set
11.4.3 Average Characteristics
Figure 11.8 shows the average area ratios across the entire benchmark set (See Table 10.13)
analogously to Figure 11.7. We see here that an input register depth of four provides almost all of
the beneﬁts of input registers, with most of the beneﬁt realized by a depth of three, as we saw with
the alu2 case in the previous section.
Figure 11.9 plots area versus throughput for various context depths ( ), at a single values
for input depth ( ). Here, was chosen to give the best results for low throughputs. For lower
throughputvalues, the 5-8 contextcases differ byonly 10%. Atthe extreme of 20clocks per result,
the 8, 6 caseis 33.7% thesize of the singlecontext case, versusthe 5, 4 casewhich
is 37.6%.
Tables 11.4 through 11.11 record implementation area ratio for all values of and . Each
table reports implemenation areas for a different ﬁxed throughput target in analog with Table 11.3.
For the maximum throughput of one result per LUT delay, the traditional, single-context FPGA
providesthebest implementation. For all othercases, the multicontextimplementations arealways
smallerthanthesingle-contextimplementation. With a LUT-cycledelayin the7-9.5ns range, even
today’s “high”throughputimplementationsin the30-50MHzrange are producingnew results only
once ever 3-5 LUT delays. At these speeds 3-4 context devices are 40-50% smaller than the single
context implementation. At lower throughputs, the multiple context implementations are almost
one-third the size of the single-context implementation on average.
224|
1
|
3
|
5
|
7
|
9
|
11
|
13
|
15
|
17
|
19
| 0.0
| 0.2
| 0.4
| 0.6
| 0.8
| 1.0
| 1.2
| 1.4
| 1.6
| 1.8
| 2.0
 Throughput Target (cycles/result)
C=1
 
R
a
t
i
o
 
A
i
D
P
G
A
/
A
F
P
G
A
|
1
|
3
|
5
|
7
|
9
|
11
|
13
|
15
|
17
|
19
| 0.0
| 0.2
| 0.4
| 0.6
| 0.8
| 1.0
| 1.2
| 1.4
| 1.6
| 1.8
| 2.0
 Throughput Target (cycles/result)
C=2
 
R
a
t
i
o
 
A
i
D
P
G
A
/
A
F
P
G
A
|
1
|
3
|
5
|
7
|
9
|
11
|
13
|
15
|
17
|
19
| 0.0
| 0.2
| 0.4
| 0.6
| 0.8
| 1.0
| 1.2
| 1.4
| 1.6
| 1.8
| 2.0
 Throughput Target (cycles/result)
C=3
 
R
a
t
i
o
 
A
i
D
P
G
A
/
A
F
P
G
A
|
1
|
3
|
5
|
7
|
9
|
11
|
13
|
15
|
17
|
19
| 0.0
| 0.2
| 0.4
| 0.6
| 0.8
| 1.0
| 1.2
| 1.4
| 1.6
| 1.8
| 2.0
 Throughput Target (cycles/result)
C=4
 
R
a
t
i
o
 
A
i
D
P
G
A
/
A
F
P
G
A
|
1
|
3
|
5
|
7
|
9
|
11
|
13
|
15
|
17
|
19
| 0.0
| 0.2
| 0.4
| 0.6
| 0.8
| 1.0
| 1.2
| 1.4
| 1.6
| 1.8
| 2.0
 Throughput Target (cycles/result)
C=5
 
R
a
t
i
o
 
A
i
D
P
G
A
/
A
F
P
G
A
|
1
|
3
|
5
|
7
|
9
|
11
|
13
|
15
|
17
|
19
| 0.0
| 0.2
| 0.4
| 0.6
| 0.8
| 1.0
| 1.2
| 1.4
| 1.6
| 1.8
| 2.0
 Throughput Target (cycles/result)
C=6
 
R
a
t
i
o
 
A
i
D
P
G
A
/
A
F
P
G
A
|
1
|
3
|
5
|
7
|
9
|
11
|
13
|
15
|
17
|
19
| 0.0
| 0.2
| 0.4
| 0.6
| 0.8
| 1.0
| 1.2
| 1.4
| 1.6
| 1.8
| 2.0
 Throughput Target (cycles/result)
C=7
 
R
a
t
i
o
 
A
i
D
P
G
A
/
A
F
P
G
A
 i=1
 i=2
 i=3
 i=4
 i=5
 i=6
 i=7
 i=8
|
1
|
3
|
5
|
7
|
9
|
11
|
13
|
15
|
17
|
19
| 0.0
| 0.2
| 0.4
| 0.6
| 0.8
| 1.0
| 1.2
| 1.4
| 1.6
| 1.8
| 2.0
 Throughput Target (cycles/result)
C=8
 
R
a
t
i
o
 
A
i
D
P
G
A
/
A
F
P
G
A
Figure 11.8: Average Area Ratios versus Throughput
225 c=1,i=1
 c=2,i=2
 c=3,i=3
 c=4,i=4
 c=5,i=4
 c=6,i=5
 c=7,i=5
 c=8,i=6
|
1
|
3
|
5
|
7
|
9
|
11
|
13
|
15
|
17
|
19
| 0.0
| 0.2
| 0.4
| 0.6
| 0.8
| 1.0
| 1.2
| 1.4
| 1.6
 Throughput Target (cycles/result)
 
R
a
t
i
o
 
A
i
D
P
G
A
/
A
F
P
G
A
Figure 11.9: Average Area Ratios versus Contexts and Throughput
Average Ratio at 2 clocks/result throughput
Ratio by Number of Contexts ( )
1 2 3 4 5 6 7 8
1 1.000 1.108 1.194 1.279 1.365 1.451 1.536 1.622
2 0.680 0.733 0.785 0.838 0.890 0.943 0.996
3 0.749 0.801 0.854 0.907 0.959 1.012
4 0.827 0.880 0.933 0.986 1.039
5 0.897 0.951 1.004 1.057
6 0.960 1.013 1.066
7 1.036 1.089
8 1.117
Table 11.5: Average Ratios for Benchmark Set
226Average Ratio at 3 clocks/result throughput
Ratio by Number of Contexts ( )
1 2 3 4 5 6 7 8
1 1.000 1.054 1.129 1.211 1.292 1.373 1.454 1.535
2 0.695 0.690 0.739 0.789 0.838 0.888 0.937
3 0.538 0.576 0.613 0.651 0.689 0.727
4 0.597 0.635 0.674 0.712 0.750
5 0.648 0.686 0.725 0.763
6 0.686 0.723 0.761
7 0.751 0.789
8 0.790
Table 11.6: Average Ratios for Benchmark Set
Average Ratio at 4 clocks/result throughput
Ratio by Number of Contexts ( )
1 2 3 4 5 6 7 8
1 1.000 1.001 0.886 0.949 1.013 1.076 1.140 1.204
2 0.680 0.530 0.560 0.598 0.635 0.673 0.710
3 0.459 0.481 0.513 0.544 0.576 0.607
4 0.461 0.491 0.520 0.550 0.579
5 0.504 0.534 0.564 0.594
6 0.529 0.558 0.587
7 0.586 0.616
8 0.616
Table 11.7: Average Ratios for Benchmark Set
227Average Ratio at 5 clocks/result throughput
Ratio by Number of Contexts ( )
1 2 3 4 5 6 7 8
1 1.000 0.972 0.960 0.938 1.002 1.065 1.129 1.191
2 0.680 0.580 0.570 0.635 0.675 0.714 0.755
3 0.488 0.460 0.484 0.514 0.543 0.573
4 0.430 0.434 0.460 0.486 0.513
5 0.432 0.457 0.482 0.508
6 0.461 0.485 0.511
7 0.487 0.512
8 0.523
Table 11.8: Average Ratios for Benchmark Set
Average Ratio at 6 clocks/result throughput
Ratio by Number of Contexts ( )
1 2 3 4 5 6 7 8
1 1.000 0.956 0.972 0.941 0.917 0.975 1.033 1.091
2 0.643 0.600 0.561 0.514 0.539 0.571 0.603
3 0.493 0.447 0.401 0.424 0.449 0.473
4 0.422 0.390 0.394 0.416 0.439
5 0.396 0.386 0.408 0.429
6 0.379 0.400 0.421
7 0.422 0.444
8 0.451
Table 11.9: Average Ratios for Benchmark Set
228Average Ratio at 10 clocks/result throughput
Ratio by Number of Contexts ( )
1 2 3 4 5 6 7 8
1 1.000 0.825 0.844 0.862 0.896 0.928 0.959 0.993
2 0.616 0.571 0.553 0.558 0.576 0.565 0.589
3 0.501 0.458 0.445 0.444 0.450 0.460
4 0.446 0.406 0.397 0.398 0.413
5 0.410 0.394 0.398 0.394
6 0.392 0.388 0.400
7 0.408 0.413
8 0.416
Table 11.10: Average Ratios for Benchmark Set
Average Ratio at 20 clocks/result throughput
Ratio by Number of Contexts ( )
1 2 3 4 5 6 7 8
1 1.000 0.758 0.765 0.784 0.821 0.861 0.904 0.950
2 0.581 0.518 0.500 0.510 0.525 0.544 0.566
3 0.448 0.425 0.418 0.419 0.427 0.442
4 0.400 0.376 0.369 0.372 0.380
5 0.380 0.355 0.346 0.349
6 0.364 0.356 0.337
7 0.358 0.343
8 0.355
Table 11.11: Average Ratios for Benchmark Set
22911.4.4 Area for Improvement
As noted previously (Sections 10.5.2 and 10.5.3), netlist mapping is oblivious of the ﬁnal
temporalimplementations. The allocation oftemporaland spatialpipeline stagesis more rigidthan
strictly necessary. As we noted above (Section 11.4.1), retiming LUTs are inserted in a stylized
fashionwhich is not likely to be optimal. Compatibility testing is stochastic and may declare many
compatible LUT groups incompatible. Consequently, tighter packing of LUTs is likely with more
sophisticated mapping tools.
23011.5 Other Input Retiming Models
RegisterFile LSM,YSE,andVEGAallusearegisterﬁletohold datavaluesbetweenproduction
and consumption like most processors. These machines are also targeted at signiﬁcantly more
sequentialization (1K 8K contexts). Consequently, they manage to use only a single port into
the register ﬁle. The register ﬁle organization has a more general access pattern since any value
can be written to any memory location and read from any location to any output. The generality
avoids packing compatibility restrictions, allowing data to be packed more tightly into memories.
However, the more general access is also signiﬁcantly more expense to support; LSM and YSE
replicate the entire memory bank, storing four copies of every data value, in order to achieve four
read ports. The restriction to a single write port is a simpliﬁcation which these machines use in
order to make the register ﬁle implementation viable.
Time Matching Instead of shifting data through a continually advancing shift register, we can
make each of the input registers take its value from the input line and load it at a speciﬁed time. In
this scheme, the input registers hold a ﬁnite number of values ( ), but are not be limited to only the
last values. Such a scheme would require a unit to match input times, making each input larger
than the iDPGA, but the increased range and packing density relaxes timing constraints on data
arrival which are useful for simplifying the task of physical mapping. This is the scheme used by
TSFPGA and it will be explored more fully in the following chapter.
23111.6 Summary
Typical tasks require two, different kinds of data transport – spatial transport to move data
from the processing element that generated it to the ones which will consume it and temporal
transport to take data from the time when it is generated to the times when it is consumed. It is
inefﬁcient to tie up expensive, spatial transport resources such as wires and switches, to perform a
temporal transport task. Tasks such as circuit evaluation have sufﬁcient requirements for temporal
transport that input retiming registers are clearly a worthwhile architectural feature to include in
a multicontext device. Implementations with multiple retiming registers are more compact than
implementations with no additional retiming resources.
As with multiple contexts, the extent to which we can save area with deep input registers
depends on the area ratio between the active interconnect and the retiming registers. Here, we
assumed the ratio between active area and instruction area was 10:1 (800K 2:78K 2), as in the
previous chapter. We assumed, the ratio between the active area and context area including both
instruction and retiming was roughly 8:1 (800K 2:104K 2). At these ratios, 4-5 context iDPGA
implementations were, on average, half to one-third the size of the single context alternative.
The best implementation varies with target throughput. At these size ratios, the 4,
4 case is moderately good across throughput ranges. It is only worse than the single context
implementation at the highest throughput, and is within 20% of the best implementation at the
lowest throughput measured here.
23211.7 Review
In the development since Chapter 7, we have seen that the area required to implement a
general-purpose computational task is composed of four parts:
1. Active interconnect area
2. Active computational processing element area
3. Task description (instruction storage) area
4. Intermediate value storage for temporal retiming
While traditional FPGA architectures have a one-to-one mapping between these components, this
resource ratio is neither necessary or efﬁcient. We further saw that active interconnect area is, by
far, the largest single component of this area, while task description and value storage areas are
small in comparison.
For a given computational task, we saw that the requirements for each of these four parts arise
from different sources. The number of instructions required to describe the task and number of
intermediates held during computation arise from the basic computational task, itself. The size
of the active interconnect and processing are dictated by the task’s target throughput. For the
highest possible throughput, the conventional FPGA strategy of allocating a single instruction to
each piece of active interconnect and processing is an efﬁcient allocation of resources. However,
as throughput requirements drop below this extreme, multicontext implementations compress the
implementation into less space by sharing and reusing a smaller number of active resources. This
sharing increases the ratio of instructions and intermediates to active resources. DPGAs are the
practical implementation of such a sharing scheme, assigning multiple instructions and multiple
intermediate values to each active resource.
Note that the amount of compressibility we achieve with DPGAs is critically dependent upon
how small we can make the non-active residue. That is, when we remove active interconnect and
processing elements, we are left with the instruction and the intermediate values. The amount
of area savings we can realize depends on how much smaller the space to hold instructions and
intermediates is than the space for the active area necessary to actually process the instruction and
itsdata. Itisthiseffect whichmotivatesourinterestsinreducingthenumberofbitsusedto describe
each instruction (Section 7.8) and in reducing the area required to store those bits (e.g. DRAM
context implementations in the DPGA prototype – Section 10.4).
It is also worthwhile to note that the style of compression used in the last two chapters (Chap-
ters 10 and 11), makes instructions and data readily accessible and is largely independent of task
structure. While densely encoded instructions need some decoding, each instruction is encoded
separately so that it can be stored locally and used immediately upon being read. If we are willing
to pay additional access latency and work with variable size encodings, block and structure-based
encoding schemes can be used, making it is possible to compress the instruction requirements
further. Ultimately, the minimum task description area will depend on the descriptive complexity
of the task (See Section 8.4). Exploiting structure, such as, data widths, operation commonality,
and task recurrence requires more general instruction distribution datapaths and more sequential
233decodingof task instructions. Nonetheless, variants on these techniques may be valuable in further
compressing instruction and data residues and hence reducing task implementation size.
23412. Time-Switched Field Programmable Gate Arrays
Weestablishedin Chapter7that activeinterconnectareaconsumedmostofthespaceontraditional,
single-context FPGAs. In Chapter 10, we saw that adding small, local, context memories allowed
us to reuse active area and achieve smaller task implementations. Even in these multicontext
devices, we saw that interconnect consumed most of the area (Section 10.4.2). In Chapter 11, we
added architectural registers for retiming and saw more clearly the way in which multiple context
evaluation saves area primarily by reducing the need for active interconnect. In this chapter,
we describe the Time-Switched Field ProgrammableGate Array (TSFPGA), a multicontext device
designedexplicitlyaroundtheideaoftime-switchingtheinternalinterconnectinordertoimplement
more effective connectivity with less physical interconnect.
One issue which we have not addressed in the previous sections is the complexity of physical
mappingand,consequently,thetimeittakestoperformsaidmapping. Becauseofthecomputational
complexity, physical mapping time can often be the primary performance bottleneck in the edit-
compile-debugcycle. It can also be the primary obstacle to achieving acceptable mapping time for
large arrays and multi-chip systems.
In particular, when the physical routing network provides limited interconnectivity between
LUTs, it is necessary to carefully map logical LUTs to physical LUTs in accordance with both
netlist connectivity and interconnect connectivity. The number of ways we can map a setof logical
LUTs to a set of physical LUTs is exponential in the the number of mapped LUTs, making the
searchforan acceptablemappingwhich simultaneouslysatisﬁesthenetlistconnectivityconstraints
and the limited physical interconnect constrains – i.e. physical place and route – computationally
difﬁcult. Finding an optimal mapping is generally an NP-complete problem. Consequently, in
traditional FPGAs, this mapping time can be quite large. It often take hours to place and route
designs with a couple of thousand LUTs. The computational complexity arises from two features
of the mapping problem:
1. AsnotedinSection11.1,traditionalFPGAsmusthaveenoughroutingresourcestophysically
route all task connections simultaneously.
2. Since interconnect is the dominant area in FPGAs (Chapter 7), conventional FPGAs try to
use as little interconnect as feasible to provide high computational density.
The result is a large set of simultaneous constraints which must be satisﬁed during mapping,
making the task of physical mapping computationally intensive. TSFPGA virtually eliminates the
simultaneousconstraint satisfaction required to successfullyroute a component,making it possible
to rapidly map tasks to the array. Simultaneous constraint satisfaction is still necessary to achieve
the highest performance mappings on TSFPGA, but is not necessary to achieve any mapping. This
gives the device user control over mapping time and quality.
This chapter details a complete TSFPGA design including:
1. Time-switched input register
2352. Techniques used by TSFPGA to avoid constraints
3. Sample interconnect model for time-switched routing
4. Complete gate-array architecture built around:
(a) time-switched input register
(b) switched interconnect
(c) pipelined interconnect
5. Area and time estimates for TSFPGA building blocks
6. Experimental, quick mapping software
7. Mapped benchmark results using experimental software and a sample design point
TSFPGA was developed jointly by Derrick Chen and Andr´ e DeHon. Derrick worked out VLSI
implementation and layout issues, while Andr´ e developed the architecture and mapping tools.
12.1 Time-Switched Input Registers
As noted in Section 11.1, if all retiming can be done in input registers, only a single wire
is strictly needed to successfully route the task. The simple input register model used for the
previouschapter had limited temporal range and hence did not quite provide this generality. In this
section, we introduce an alternative input strategy which extends the temporal range on the inputs
without the linear increase in input retiming size which we saw with the shift-register based input
microarchitecture in the previous chapter.
Thetrickweemployhereisto haveeachlogicalinputloaditsvaluefromtheactiveinterconnect
at just the right time. As we have seen, multicontext evaluation typically involves execution of a
series of microcycles. A subset of the task is evaluated on each microcycle, and only that subset
requiresactiveresourcesineachmicrocycle. We calleachmicrocycleatimestepand, conceptually
atleast,numberthemfromzerouptothetotalnumberofmicrocyclesrequiredto completethetask.
If we broadcast the current timestep, each input can simply load its value when its programmed
load time matches the current timestep.
Figure 12.1 shows a 4-LUT with this input arrangement which we will call the Time-Switched
Input Register. Each LUT input can load any value which appears on its input line in any of the
last cycles. The timestep value is log2 bits wide, as is the comparator. With this scheme, if
the entirecomputation completesin timesteps, all retiming is accomplishedby simply loading the
LUT inputs at the appropriate time – i.e. loading each input just when its source value has been
produced and spatially routed to the destination input. Since the hardware resources required for
this scheme are only logarithmic in the total number of timesteps, it may be reasonable to make
large enough to support most all desirable computations.
With this input structure, logical LUT evaluation time is now decoupled from input arrival
time. This decoupling was not true in FPGAs, DPGAs, or even iDPGAs. With FPGAs, the LUT
is evaluated only while the inputs are held stable. With DPGAs, the LUT is evaluated only on the
microcycle when the inputs are delivered. With the iDPGA, the LUT must be evaluated on the
236=
=
=
=
Timestep
in0
in1
in2
in3
LUT Memory
Input Register
Timestep
 Memory
Timestep
Compare
LUT Mux
out
Figure 12.1: 4-LUT with Time-Switched Input Register
correct cycle relative to the arrival of the input, and the range of feasible cycles was limited by .
Further, with the time-switched input register, the inputs are stored, allowing the LUT result to be
producedon any microcycle,or microcycles,followingthe arrival of theﬁnal input. Inthe extreme
caseofa single wirefor interconnect,eachLUToutputwould be producedandrouted on aseparate
microcycle. Strictly speaking, of course, with a single wire there need be only one physical LUT,
as well.
This decoupling of the input arrival time and LUT evaluation time allows us to remove the
simultaneous constraints which, when coupled with limited interconnectivity, made traditional
programmablegate arraymappingdifﬁcult. We areleftwith asingle constraint: scheduletheentire
task within timesteps.
12.2 Switched Interconnect – Folding
Now that we no longer need to involve the physical interconnect in temporal transport, we are
freeto reuse physical interconnectresources attheir minimumoperating time. This reuse allows us
to employ less physical interconnect than traditional FPGAs, while simultaneouslyproviding more
connectivity.
12.2.1 Subarray Structure
Conceptually, letus consider array interconnect as composedof a series of fully interconnected
subarrays. That is, we arrange groups of LUTs in subarrays, as in the DPGA prototype (See
Section 10.4). Within a subarray, LUTs are fully interconnected with a monolithic crossbar. Also
feeding into and out of this subarray crossbar are connections to the other subarrays.
The subarray contains a number of LUTs, , where we consider 64 as
typical. Connecting into the subarray are inputs from outside. Similarly, connect
out. , and are typically governed by Rent’s Rule. With 64, 4, and
0 5 0 7, we might expect 32 74, and consider
23764 typical and convenient.
Together, this suggests a ( ) ( ) crossbar, which is
128 320 for the typical values listed above. This amounts to 640 switches per 4-LUT, which
is about 2-3 the values used in conventional FPGA architectures as we reviewed in Section 7.5.
Conventional architectures, effectively, only populate 30-50% of the switches in such a block
relying on placement freedom to make up for the missing switches. It is, of course, the complexity
ofthe placementproblem in lightof this depopulationwhichis largely responsible for thedifﬁculty
of place and route on conventional architectures.
We also need to interconnect these subarrays. For small arrays it may be possible to simple
interwiretheconnectionsbetweensubarrayswithoutsubstantial,additionalswitching. Thisislikely
the case for the 100-1000 LUT cases reviewed in Section 7.5. For larger arrays, more, inter-array
switching will be required to provide reasonable connectivity. As we derived in Section 7.6, the
interconnect requirements will grow with the array size.
12.2.2 Interconnect Folding
With switched interconnect, we can realize a given level of connectivity without placing all of
the physical switches that such connectivity implies. Rather, with the ability to reuse the switches,
we effect the complete connectivity over a series of microcycles.
We canviewthisreuseasafoldingoftheinterconnectintime. Forinstance, wecouldmappairs
of LUTs together such that they share input sets. This, coupled with cutting the number of array
outputs( )inhalf,willcutthenumberofcrossbaroutputsinhalfandhencehalvethesubarray
crossbar size. For full connectivity, it may now takes us two microcycles to route the connections,
delivering the inputs to half the LUTs and half the array outputs in each cycle. In this particular
case we have performed output folding by sharing crossbar outputs (See Figure 12.2). Notice that
the time-switched input register allows us to get away with this folding by latching and holding
values on the correct microcycle. The input register also allows the non-local subarray outputs to
be transmittedover two cycles. In the mosttrivial case, thearray outputswill be connected directly
to array inputs in some other array and, through the destination array’s crossbar, they will, in turn
beconnected to LUT inputswheretheycan be latchedon theappropriate microcycleas theyarrive.
There is one additional feature worth noting about output folding. When two or more folded
LUTs share input values all the LUTs can load the input when it arrives. For heavily output folded
scenarios, these shared inputs can be exploited by appropriate grouping to allow the task to be
routed in less microcycles than the total network sharing.
We canalsoperforminputfolding. Withinputfolding,wepairLUTsso thattheyshare asingle
LUT output. Here we cut the number of array inputs ( ) in half, as well. The array crossbar
now requires only half as many inputs as before and is, consequently, also half as large in this case.
Again, the latched inputs allow us to load each LUT input value only on the microcycle on which
the associated value is actually being routed through the crossbar. For input folding, we must add
an effective pre-crossbar multiplexor so that we can select among the sources which share a single
crossbar input (See Figure 12.3).
It is also possible to fold together distinct functions. For example, we could perform an input
fold such that the 64 LUT outputs each shared a connection into the crossbar with the 64 array
inputs. Alternately, we could perform an output fold such that LUT inputs shared their connections
238Figure 12.2: Output Folding
Figure 12.3: Input Folding
with array outputs.
Finally, note that we can perform input folding and output folding simultaneously (See Fig-
ure 12.4). We can think of the DPGAs introduced in Chapter 10 as folded interconnect where we
folded both the network input and output times. Each DPGA array element (See Figure 11.2)
shared logical LUT inputs on one set of physical LUT inputs and shared logical LUT outputs
on a single LUT output. Figure 12.5 shows how a two context DPGA results from a single input
and output fold. In the DPGA, we had only routing contexts for this 2 total folding. To get away
with this factor of reduction in interconnect description, we had to restrict routing to temporally
adjacent contexts. As we saw, in Chapter 10 this sometimes meant we had to allocate LUTs for
through routing when connections were needed between contexts.
Routing on these folded networks naturally proceeds in both space and time. This gives the
networks the familiar time-space-time routing characteristics pioneered in telephone switching
systems.
239Figure 12.4: Input and Output Folding
Figure 12.5: Two-Context DPGA as Input and Output Fold
24012.3 Architecture
In this section, we detail a complete TSFPGA architecture. The basic TSFPGA building block
is the subarray tile (See Figure 12.6) which contains a collection of LUTs and a central switching
crossbar. LUTs share output connections to the crossbar and input connections from the crossbar
in the folded manner described in the previous section. Communication within the subarray can
occur in one TSFPGA clock cycle. Non-local input and output lines to other subarrays also share
crossbar I/O’s to route signals anywhere in the device. Routes over long wires are pipelined to
maintain a high basic clock rate.
Array Element The TSFPGA array element is made up of a number of LUTs which share the
same crossbar outputs and input (See Figure 12.7). The LUT output into the crossbar is selected
based on the routing context programming. As shown, each array element shares its crossbar input
with several network inputs.
Crossbar
xout0
xout1
xout2
xout3
yout0
yout1
y
o
u
t
3
yout2
pipeline
registers
Timestep Context
Interconnect Memory
xin0
xin1
xin2
xin3
yin0
yin1
yin2
yin3
AE
AE
AE
AE
(Subarray shown is smaller than typically used in practice in order to
avoid unnecessarily complicating the diagram.)
Figure 12.6: TSFPGA Subarray Composition
241=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
Timestep
in0
in1
in2
in3
From Crossbar Outputs
LUT Memory
Input Register
Timestep
 Memory
Timestep
Compare
LUT Mux
Output Mux   Crossbar
Input Select
out
To Crossbar Input
Network Inputs
  to Subarray
Figure 12.7: TSFPGA Array Element Composition
The LUT input values are stored in time-switched input registers. The inputs to the array
element are run to all LUT input registers. When the current timestep matches the programmed
load time, the input register is enabled to load the value on the array-element input. When multiple
LUTs in an array element take the same signal as input, they may be loaded simultaneously.
UnliketheDPGAarchitecturesdetailedinChapters10and11, theLUTmultiplexorisreplicated
for each logical LUT. As we saw in Section 10.4.2, the LUT mux is only a tiny portion of the
area in an array. Replicating it along with context memory avoids the need for ﬁnal LUT input
multiplexors which would otherwise be in the critical path. When considering both the additional
input multiplexors and the requirements for selecting among the LUT programming memory, the
beneﬁt of resource sharing at this level have been minimal in most of the implementations we have
examined.
242Crossbar The primary switching element is the subarray crossbar. As shown in Figures 12.6
and 12.7, each crossbar input is selected from a collection of subarray network inputs and subarray
LUT outputs via by a pre-crossbar multiplexor. Subarray inputs are registered prior to the pre-
crossbar multiplexor and outputs are registered immediately after the crossbar, either on the LUT
inputsor beforetraversingnetworkwires. This pipeliningmakestheLUT evaluationsand crossbar
traversal a single pipeline stage. Each registered, crossbar output is routed in several directions to
provide connections to other subarrays or chip I/O.
Inter-subarray wire traversals are isolated into a separate pipeline stage between crossbars. As
we saw both in Section 7.1.3 and the DPGA prototype implementation (Section 10.4.2), wire and
switch traversals are responsible for most of the delay in programmable gate arrays. By pipelining
routesatthesubarraylevel, we canachievea smallermicrocycletime andeffectively extracthigher
capacity from our interconnect.
Notice that the network in TSFPGA is folded such that the single subarray crossbar performs
all major switching roles:
1. output crossbar – routing data from LUT outputs to destinations or intermediate switching
crossbars
2. routing crossbar – routing data through the network between source and destination subarrays
3. input crossbar – receiving data from the network and routing it to the appropriate destination
LUT input
This sharing avoids dedicating specialized routing resources to any single function so that the
available resources can be deployed as needed by the task. Connections on TSFPGA are statically
routed in a distributed, multistage switching fashion.
Intra-Subarray Switching Communication within the subarray is simple and takes one clock
cycle per LUT evaluation and interconnect. Once a LUT has all of its inputs loaded, the LUT
output can be selected as an input to the crossbar, and the LUT’s consumers within the subarray
may be selected as crossbar outputs. At the end of the cycle, the LUT’s value is loaded into the
consumers’ input registers, making the value available for use on the next cycle.
Inter-Subarray Switching Figure 12.8 shows the way a subarray may be connected to other
subarrays on a component. A number of subarray outputs are run to each subarray in the same
row and column. For large designs, hierarchical connections may be used to keep the bandwidth
between subarrays reasonable for while maintaining a limited crossbar size and allowing distant
connections. The hierarchical connectionscan givethe logical effect of a three or four dimensional
network.
Routing data within the same row or column involves:
1. Route LUT output through crossbar to the outputs headed for the destination subarray.
2. Traverse the wire between subarrays.
3. Select network input with source value as a crossbar source and route through the crossbar to
the destination LUT input.
When data needs to traverse both row and column:
1. Route LUT output to ﬁrst dimension destination (row, column).
2. Traverse ﬁrst dimension interconnect.
243SA SA SA SA SA
SA SA SA SA
SA SA SA SA SA
SA SA SA SA SA
SA SA SA SA SA
SA
Figure 12.8: Sample Inter-Subarray Network Connections
3. Switch output in second dimension (column, row).
4. Traverse second dimension interconnect.
5. Switch to LUT input and load.
Pipelining places each of these operations in a different clock cycle. Long wire connections may
merit multiple clock cycles for wire traversal – this is likely to be true for long, hierarchical
connections. Short wires, particularly the nearest neighbor connections, may not always merit a
separate pipeline stage for wire traversal.
I/O Connections I/O connections are treated like hierarchical network lines and are routed into
and out of the subarrays in a similar manner. Each input has an associated subarray through which
it may enter the switched network. Similarly, each output is associated with the crossbar output
of some subarray. Device outputs are composed of time-switched input registers and load values
from the network at designated timesteps like LUT inputs. Alternately, an output may look like
inter-subarray pipeline register for routing in multichip systems.
Array Control Two “instruction” values are used to control the operation of each subarray on
a per clock cycle basis, timestep and routing context (shown in Figure 12.6). The routing context
serves as an instruction pointer into the subarray’s routing memory. It selects the conﬁguration
of the crossbar and pre-crossbar multiplexors on each cycle. timestep denotes time events and
244Number of LUT inputs
Maximum retiming depth
LUTs in subarray
External inputs to subarray
External outputs to subarray
Subarray crossbar inputs
Subarray crossbar outputs
Number of routing contexts
Table 12.1: TSFPGA Subarray Parameters
indicates when values should be loaded from shared lines.
Thesetwovaluesaredistinctinordertoallowdesignswhichtakemoremicrocyclestocomplete
than they actually require contexts. The evaluation of a function will take a certain number of
TSFPGA clock cycles. This time is dictated by the routed delay. For designs with large serial
paths,longcriticalpaths,orpoorlocalityofcommunications,therouteddelaymaybelargewithout
consuming all the routing resources. For this reason, it is worthwhile to segregate the notion of
timestep from the notion of routing contexts. Each routing interconnect pattern, or context, may
be invoked multiple times at different timesteps during an evaluation. This allows us to have a
small number of routing contexts even when the design topology necessitates a large number of
timesteps.
As a trivialexample, considerthe caseof anunfoldedsubarray. With full subarray interconnect
theremaybe enoughphysicalinterconnectin a singlecontextto actuallyroutethe completedesign.
However, sincethedesignhasacriticalpathcomposedofmultipleLUTdelays,it willtakemultiple
microcycles to evaluate. In this case, it is not necessary to allocate separate routing contexts for
each timestep as long as we segregate these two speciﬁcations into separate entities.
12.4 Architecture Parameters
The subarray composition effectively determines the makeup of a TSFPGA component. Ta-
ble12.1summarizesthebaseparameterscharacterizingaTSFPGAsubarrayimplementation. From
these, we can calculate resource size and sharing:
(12.1)
(12.2)
crossbar pre-crossbar mux
(12.3)
Assuming we need to route through one intermediate subarray, on average, the number of
245routing contexts, , needed for full connectivity is:
12 4
That is, we need one context to drive each crossbar source for each crossbar sink. When we have
the same number of contexts as we have total network sharing, we can guarantee to route anything.
Relation 12.4 assumes that and are chosen reasonably large to support trafﬁc to,
from, and through the array. If not, the sharing of inter-subarray network lines will dominate strict
crossbar sharing, and should be used instead.
In practice, we can generally get away with less contexts than implied by Relation 12.4 by
selectively routing outputs and inputs as needed. When LUT inputs sharing a crossbar output
also share input values or when the mapped design requires limited connectivity, less switching is
needed, and the routing tasks can be completedwith fewer contexts. The freedom to specify which
output drives each crossbar input on a given cycle, as provided by the TSFPGA subarray, is not
strictly necessary. We could have a running counter which enables each source in turn. However,
with a ﬁxed counter it would always take cycles to pull out any particular source,
despite the fact that only a few are typically needed at any time. The effect is particularly acute
when we think about levelized evaluation where we may be able to simultaneously route all the
LUTs whose results are currently ready and needed in a single cycle. For this reason, TSFPGA
provides independent control over the pre-crossbar input mux.
In total, the number of routing bits per LUT, then, is:
log2 log2 ( )
12 5
Additionally,each LUT has its own programmable function and matching inputs:
(12.6)
2 (12.7)
log2 ( ) (12.8)
24612.5 TSFPGA Implementation Estimates
12.5.1 Area
The time-switched input register is the most novel building block in the TSFPGA design. A
prototype layout by Derrick Chen was:
Time-Switched Input register with comparators ( 256) 32K 2
LUT multiplexor with SRAM function memory 32K 2
Complete base LUT ( ) 160K 2
Note that contains the LUT multiplexor, LUT function memory, all 4 input registers,
their associated comparators, and the comparator programming.
The amortized area per LUT then is:
(12.9)
Using the 64 subarray as a reference, a version with no
folding has, 35, 640, making 1800K 2, which is about
2-3 thesizeoftypical4-LUTs. But,aswenotedabove,unfolded,we have2-3 asmanyswitches
as a conventional FPGA implementation. Also, unfolded, the expensive, matching input register is
not needed.
If we fold the input and output each once, 64, 160, the number
of switches drops to 162. With four routing contexts ( 4), routing bits rise to
64. Thetotalareais 640K 2,whichiscomparablein sizewithmodern
FPGA implementations, while providing 2-3 the total connectivity.
FocusDesignPoint For thesakeof evaluation,we settledona single,highlyfolded, designpoint
forcloseinspection. Fromourexperiencewith theDPGAandotherVLSIefforts, we choseto usea
16 16crossbarasthebaseinterconnect( 16), balancingthedesireto keep
the crossbar compact and fast with the desire to perform as high radix switching as feasible. Per
LUT switches drops to almost a trivial level, 6. With 64 routing contexts, switching
bits rises to 112. Along with the 16 bits for LUT function programming,
and 32 bits for input match programming, this brings the total number of programming bits per
LUT up to 160, which is comparable to conventional FPGAs (See Table 7.2). The LUT area is
310K 2, or about half the size of a conventional FPGA 4-LUT.
At this size, each TSFPGA 4-LUT is effectively larger than the logical 4-LUT area in the
iDPGAs of the previous chapter. The added complexity and range ( 256) of the time-switched
inputregistersislargelyresponsibleforthegreatersize. Thetime-switchedinputregisterfeatures,in
turn,arewhatallowusmapdesignswithoutsatisfyingalargenumberofsimultaneouslyconstraints.
12.5.2 Timing
Within thesubarray, thecritical pathfortheoperatingcycle ofthis designpointcontains:
1. clock to Q delay on the context address
2472. Context memory read from 64-word deep memory
3. 8:1 pre-crossbar input mux
4. 16 16 crossbar traversal
5. setup time for the crossbar output ﬂip-ﬂops
For higher performance, the context read could be placed in its own pipeline stage. As noted, wire
traversal already operate as a separate pipeline stage of its own. When wire delays begin to exceed
theintra-subarraycycledelay, wecanaddadditionalpipeliningtowiretraversal. Fromsimulations,
it looksfeasible to runwith a 200 MHzmicrocycle. This is roughlytwice themicrocyclefrequency
for the DPGA design. The speedup here comes primarily from separating intra-subarray routing
and inter-subarray routing into separate pipeline stages.
24812.6 TSFPGA Fast Circuit Mapping
Traditional logic and state-element netlists can be mapped to TSFPGA for levelized logic
evaluation similar to the DPGA mapping in the previous two chapters. Using this model, only the
ﬁnal place-and-route operation must be specialized to handle TSFPGA’s time-switched operation.
Of course, front-end netlist mapping which takes the TSFPGA architecture into account may be
able to better exploit the architecture, producing higher performance designs.
tspr, our ﬁrst-pass place-and-route tool for TSFPGA, performs placement by min-cut parti-
tioning and routing by a greedy, list-scheduling heuristic. Both techniques are employed for their
simplicity and mapping-time efﬁciency rather than their quality or optimality. The availability of
adequate switching resource, expandable by allocating more conﬁguration contexts, allows us to
obtainreasonableperformancewith thesesimplemappingheuristics. For themostpart, the penalty
for poor quality placement and routing in TSFPGA is a slower design, not an unroutable design.
TimestepsandContexts Itisagainworthnotingthatthenumberoftimestepsandroutingcontexts
are dictated by different properties of the mapped network.
The topologyofthe circuitwill determinethecritical pathlength, orthe numberoflogical LUT
delaysbetween the inputsand outputs ofthe circuit. This critical pathlength is onelower bound on
the number of timesteps required to evaluate a circuit. However, once placed onto subarrays, there
is another, potentially longer, bound, the distance delay through the network. The distance delay
is the length of the longest path through the circuit including the cycles required for inter-subarray
routing. If all the LUTs directly along every critical path can be mapped to a single subarray, it is
possible that the distance delay is equal to the critical path length. However, in general, the placed
criticalpath crossessubarrays resultingin a longer distancedelay. The quality ofthe distancedelay
is determined entirely during the placement phase.
The actual routed delay is generally larger than the distance delay because of contention. That
is, if the architecture does not provide enough physical resources to route all the connections in the
placed criticalpathsimultaneously,or the ifthe greedyrouting algorithmsallocates those resources
suboptimally,signals may take additional microcycles to actually be routed.
Placement Partitioning is based on the Fiduccia-Mattheyses min-cut heuristic [FM82]. Netlists
are recursively partitioned along TSFPGA dimension boundaries. That is, for a simple, two-
dimensional network topology, as shown in Figure 12.8, the design is ﬁrst partitioned for columns,
then columns are partitioned into subarrays. For larger networks, top-level row and column
partitioning would precede low-level row and column partitioning. The Fiduccia-Mattheyses
heuristic aims to minimize the size of the cut net, but does not, directly, minimize the effect of
cuts on circuit delay. As a consequence partitioning is useful in reducing the routing congestion
contribution to routed delay, but does not explicitly try to minimize the distance delay.
For the fastest mapping times, no sophisticated placement is done. Circuit netlists are packed
directly into subarrays as they are parsed from the netlists. Such oblivious placement may create
unnecessarily long paths by separating logically adjacent LUTs and may create unnecessary con-
gestion by not grouping tightly connected subgraphs. However, with enough routing contexts the
TSFPGA architecture allows us to succeed at routing with such poor placement.
249Routing Routingisdirectedbythecircuitnetlisttopologyusingagreedy,list-schedulingheuristic.
At the start, a ready list is initialized with all inputs and ﬂip-ﬂop outputs. Routing proceeds by
pickingtheoutputintheready listwhichisfarthestfromtheendsetofprimaryoutputsandﬂip-ﬂop
inputs. Routing a signal implies reserving switch capacity in each contextand timestepinvolved in
the route. If a route cannot be made starting at the current evaluation time, the starting timestep for
the route is incremented and the route search is repeated. Currently, only minimum distance routes
are considered. Assuming adequate context memory, every route will eventual succeed. Once a
route succeeds, any LUT outputs which are then ready for routing are added to the ready list. In
this manner, the routing algorithm works through the design netlist from inputs to outputs, placing
and routing each LUT as it is encountered.
Modulo Context Routing The total number of contexts is dictated by the amount of contention
for shared resources. Since some timesteps may route only a few connections, a routing context
may be used at multiple timesteps. In the simplest case, switches in a routing context not used
duringone timestepmaybe allocatedand usedduring another. Inmore complicatedcases, a switch
allocated in one context can be reused with the same setting in another routing context. This is
particularly useful for the inter-subarray routing of patterns, but may be computationally difﬁcult
to exploit.
Our experimental mapping software can share contexts among routing timesteps by modulo
context assignment. That is context mod is used to route on timestep . As we will
see in the next section, this generally allows us to reduce the number of required contexts. Further
context reduction is possible when we are willing to increase the number of timesteps required for
evaluation. More sophisticatedsharingschemes arelikelyto be capableofproducingbetterresults.
250Netlist Size Target Array Quick Map Performance Map Best
Design LUTs IOs Tiles LUTs IOs Time Delays Time Delays Map
(SA) (sec.) LUT Dist Rte (sec.) Dist Rte Rte
5xp1 46 17 2 1 128 32 0.05 11 14 19 0.67 14 19 18
9sym 123 10 2 1 128 32 0.18 8 15 29 4.02 15 25 23
9symml 108 10 2 1 128 32 0.15 9 17 27 9.48 13 24 21
C499 85 73 3 2 384 96 0.15 11 22 34 3.06 23 33 25
C880 176 86 3 2 384 96 0.34 22 44 48 9.67 31 36 32
alu2 169 16 2 2 256 64 0.28 20 43 45 10.00 43 47 34
apex6 248 234 4 4 1024 256 0.69 10 27 37 34.06 19 23 23
apex7 77 86 3 2 384 96 0.16 8 19 24 3.15 16 19 19
b9 46 62 2 2 256 64 0.08 8 15 21 0.74 12 14 14
clip 121 14 2 1 128 32 0.19 10 23 29 5.08 18 26 23
cordic 367 25 3 2 384 96 0.98 14 47 60 26.59 40 43 39
count 46 51 2 2 256 64 0.10 17 26 27 1.22 24 25 21
des 1267 501 6 6 2304 576 6.30 14 51 66 626.40 37 43 35
e64 230 130 3 3 576 144 0.63 10 29 40 18.90 32 33 26
f51m 45 16 2 1 128 32 0.07 18 21 22 0.05 21 22 22
misex1 20 15 1 1 64 16 0.03 7 10 16 0.02 10 13 13
misex2 38 43 2 2 256 64 0.07 9 15 18 0.95 15 16 15
rd73 105 10 2 1 128 32 0.14 11 18 27 4.35 14 22 21
rd84 150 12 2 2 256 64 0.24 10 30 35 4.88 26 30 24
rot 293 242 4 4 1024 256 0.76 17 44 45 21.14 28 31 31
sao2 73 14 2 1 128 32 0.11 10 14 22 1.79 13 20 18
vg2 60 33 2 2 256 64 0.11 10 17 23 1.07 14 19 19
z4ml 8 11 1 1 64 16 0.03 8 11 12 0.02 11 12 12
Run times given are in seconds on a SparcStation 20 Model 71 (rated at 125 SPECint92).
Table 12.2: TSFPGA Mappings for MCNC Circuit Benchmarks
12.7 Circuit Mapping
In this section we show the results of mapping the same MCNC benchmark circuit suite used
for the DPGA in the previous two chapters to TSFPGA. These benchmarks are mapped viewing
TSFPGA simply as an FPGA with time-switched interconnect, ignoring the way one might tailor
tasks to take full advantage of the architecture.
Table 12.2 shows the results of mapping the benchmark circuits to TSFPGA. The same area
mapped circuits from sis and Chortle used in Sections 10.5.3 and 11.4 were used for this
mapping. Each design was mapped to the smallest rectangular collection of subarray tiles which
supported both the design’s I/O and LUT requirements. Quick mapping does oblivious placement
whilethe performancemapping takestime to dopartitioning. Boththe quickand performancemap
251Target Ratios Quick Map Performance Map Best
Design LUTs IOs Delay Ratios Delay Ratios Route
% used Dist Route Dist Route Ratio
5xp1 0.36 0.53 1.27 1.73 1.27 1.73 1.64
9sym 0.96 0.31 1.88 3.62 1.88 3.12 2.88
9symml 0.84 0.31 1.89 3.00 1.44 2.67 2.33
C499 0.22 0.76 2.00 3.09 2.09 3.00 2.27
C880 0.46 0.90 2.00 2.18 1.41 1.64 1.45
alu2 0.66 0.25 2.15 2.25 2.15 2.35 1.70
apex6 0.24 0.91 2.70 3.70 1.90 2.30 2.30
apex7 0.20 0.90 2.38 3.00 2.00 2.38 2.38
b9 0.18 0.97 1.88 2.62 1.50 1.75 1.75
clip 0.95 0.44 2.30 2.90 1.80 2.60 2.30
cordic 0.96 0.26 3.36 4.29 2.86 3.07 2.79
count 0.18 0.80 1.53 1.59 1.41 1.47 1.24
des 0.55 0.87 3.64 4.71 2.64 3.07 2.50
e64 0.40 0.90 2.90 4.00 3.20 3.30 2.60
f51m 0.35 0.50 1.17 1.22 1.17 1.22 1.22
misex1 0.31 0.94 1.43 2.29 1.43 1.86 1.86
misex2 0.15 0.67 1.67 2.00 1.67 1.78 1.67
rd73 0.82 0.31 1.64 2.45 1.27 2.00 1.91
rd84 0.59 0.19 3.00 3.50 2.60 3.00 2.40
rot 0.29 0.95 2.59 2.65 1.65 1.82 1.82
sao2 0.57 0.44 1.40 2.20 1.30 2.00 1.80
vg2 0.23 0.52 1.70 2.30 1.40 1.90 1.90
z4ml 0.12 0.69 1.38 1.50 1.38 1.50 1.50
Average 0.46 0.62 2.08 2.73 1.80 2.24 2.01
All delay ratios are Dist,Route Delay
LUT Delay
Table 12.3: TSFPGA Mappings for MCNC Circuit Benchmarks (Ratios)
use the same, greedy routing algorithm. As noted in Section 12.6, fairly simple placement and
routingtechniquesareemployed,sohigherqualityroutingresultsarelikelywithmoresophisticated
algorithms. Quick mapping can route designs in the order of seconds, while performance mapping
runs in minutes. The experimental mapping software implementation has not been optimized for
performance, so the times shown here are, at best, a loose upper bound on the potential mapping
time. The “Bestmap” results in Table 12.2 summarize the best results seen over several runs of the
“performance” map.
Table 12.3shows usage andtime ratiosderived fromTable12.2. All ofthe mappeddelay ratios
are normalized to the number of LUT delays in the critical path. We see that the quick mapped
252Design Min Delay Min Contexts
delay # ctx ctx % delay # ctx ctx %
5xp1 19 15 4 0.21 28 12 7 0.37
9sym 25 21 4 0.16 38 17 8 0.32
9symml 24 20 4 0.17 41 19 5 0.21
C499 33 19 14 0.42 48 16 17 0.52
C880 36 29 7 0.19 54 19 17 0.47
alu2 47 43 4 0.09 107 28 19 0.40
apex6 23 19 4 0.17 34 16 7 0.30
apex7 19 14 5 0.26 25 11 8 0.42
b9 14 10 4 0.29 18 8 6 0.43
clip 26 22 4 0.15 42 19 7 0.27
cordic 43 39 4 0.09 106 34 9 0.21
count 25 21 4 0.16 31 14 11 0.44
des 43 39 4 0.09 67 35 8 0.19
e64 33 30 3 0.09 58 28 5 0.15
f51m 22 18 4 0.18 40 13 9 0.41
misex1 13 10 3 0.23 16 8 5 0.38
misex2 16 12 4 0.25 20 9 7 0.44
rd73 22 17 5 0.23 32 14 8 0.36
rd84 30 26 4 0.13 52 25 5 0.17
rot 31 27 4 0.13 48 21 10 0.32
sao2 20 16 4 0.20 33 15 5 0.25
vg2 19 15 4 0.21 29 14 5 0.26
z4ml 12 4 8 0.67 16 3 9 0.75
Table 12.4: Modulo Context Sharing for MCNC Benchmarks
delays are almost 3 the critical path LUT delay, while the performance mapped delays are closer
to 2 . As we noted in Section 12.5.2, the basic microcycle on TSFPGA is half that on the DPGA,
suggesting that the performance mapped designs achieve roughly the same average latency as full,
levelized evaluation on the DPGA. We can see from the distance delay averages that placement
dictateddelayisresponsiblefor alargerpercentageofthedifferencebetweencriticalpathdelayand
routed delay. However, since the routed delay is larger than the distance delay, network resource
contention and suboptimal routing are partially responsible for the overall routed delay time.
Context Compression As noted in Section 12.6 we can use modulo context assignment to pack
designs into fewer routing contexts at thecost of potentiallyincreasing the delay. Table 12.4 shows
the number of contexts into which each of the designs in Table 12.2 can be packed both with
and without expanding their delay. Figure 12.9 shows how routed delay of several benchmarks
increases as the designs are packed into fewer routing contexts.
253 C499
 alu2
 count
 C880
|
10
|
20
|
30
|
40
|
50
| 10
| 20
| 30
| 40
| 50
| 60
| 70
| 80
| 90
| 100
| 110
 Number of Contexts
 
T
i
m
e
 
S
t
e
p
s
Figure 12.9: Sample Delay Increases with Context Packing
25412.8 Related Work
Dharma [BCK93] time-switched two monolithic crossbars. It made the same basic reduction
as the DPGA – that is, rather than having to simultaneously route all connections in the task, one
only needed to route all connections on a single logical evaluation level. To make mapping fast,
Dharma used a single monolithic crossbar. For arrays of decent size, the full crossbar connecting
all LUTs at a single level can still be prohibitively large. Further, Dharma had a rigid assignment
of LUT evaluation and hence routing resources to levels. As we see in TSFPGA, it is not always
worth dedicating all of one’s routing resources, an entire routing context, to a single evaluation
timestep. Dharma deals with retiming using a separate ﬂow-through crossbar. While the ﬂow
through retiming buffers are cheaper than full LUTs, they still consume active routing area which
is expensive. As noted in Chapter 11, it is more area efﬁcient to perform retiming, or temporal
transport, in registers than to consume active interconnect.
VEGA [JL95], noted in Sections 10.2 and 11.5, uses a 1024 deep context memory, essentially
eliminating the spatial switching network, and uses a register ﬁle to retime intermediate data.
The VEGA architecture allows similar partitioning and greedy routing heuristics to be used for
mapping. However, the heavy multicontextingand trivial network in VEGA means that it achieves
its simpliﬁed mapping only at the cost of a 1024 reduction in active peak capacity and 100
penalty in typical throughput and latency over traditional FPGA architectures.
PLASMA [ACC 96] was built for fast mapping of logic in reconﬁgurable computing tasks.
It uses a hierarchical series of heavily populated crossbars for routing on chip. The existence of
rich, hierarchical routing makes simple partitioning adequate to satisfy interconnect constraints.
The heavily populated crossbars make the routing task simple. The basic logic primitive in
PLASMA is the PALE, which is roughly a 2-output 6-LUT. The PLASMA IC packs 256 PALEs
into 16.2mm 16.2mm in a 0.8 CMOS process, or roughly 1.6G 2. This comes to 6.4M 2
per PALE. If we generously assume each PALE is equivalent to 8 4-LUTs, the area per 4-LUT
is 800K 2, which is commensurate with conventional FPGA implementations, or about 2-2.5
the size for the TSFPGA design point described above. In practice, the TSFPGA LUT density
will be even greater since it is not always the case that 8 4-LUTs can be packed into each PALE.
PLASMA’s sizeisadirectresultofthefactthatitbuildsallofitsrichinterconnectasactiveswitches
andwires. Routingis notpipelined onPLASMA,andcriticalpaths oftencrosschipboundaries. As
a result, typical PLASMA designs run with high latency and a moderately slow system clock rate
(1-2 MHz). This suggests a time-switched device with a smaller amount of physical interconnect,
such as TSFPGA, could provide the same level of mapping speed and mapped performance in
substantially less area.
Virtual Wires [BTA93] employs time-multiplexing to extract higher capacity out of the I/O
and inter-chip network connections only. Virtual Wires folds the FPGA and switching resources
together,usingFPGAsforinter-FPGAswitchingaswellaslogic. SinceVirtualWiresis primarilya
techniqueused ontop of conventional FPGAs, it does acceleratethe task ofrouting each individual
FPGA or provide any greater use of on-chip active switching area.
Li and Cheng’s Dynamic FPID [LC95] is a time-switched Field-Programmable Interconnect
Device (FPID) for use in partial-crossbar interconnection of FPGAs. Similarly, they increase
switchingcapacity, and henceroutability and switch density, by dynamically switching a dedicated
FPID.
255UCSB researchers [cLCWMS96] consider adding a second routing context to a conventional
FPGA routing architecture. Using a similar circuit benchmark suite, they ﬁnd that the second
context reduceswire and switching requirements by 30%. Since they otherwise use a conventional
FPGA architecture, there is no reduction in mapping complexity for their architecture.
12.9 Conclusions
We have developed a new, programmable gate-array architecture model based around time-
switching a modest amount of physical interconnect. The model deﬁnes a family of arrays with
varying amounts of active interconnect. The key, enabling feature in TSFPGA is an input register
which performs a wide range of signal retiming, freeing the active interconnect from performing
data retiming or conveying data at rigidly deﬁned times. Coupling the ﬂexible retiming with
reusable interconnect, we remove most of the constraints which make the place and route task
difﬁcult on conventional FPGA architectures. Consequently, even large designs can be mapped
onto a TSFPGA array in seconds. More sophisticated mapping can be used, at the cost of longer
mapping times, to achieve the lowest delay and best resource utilization. We demonstrated the
viability of this fast mapping scheme bydeveloping experimental mapping software for one design
point and mapping traditional benchmark circuits onto these arrays. At the heavily time-switched
design point which we explored in detail, the basic LUT size is half that of a conventional FPGA
LUT while mapped design latency is comparable to the latency on fully levelized DPGAs.
12.10 Open Issues
At this point, we have left a numberof interestingissues associated with TSFPGAunanswered.
Performance using traditional place and route strategies – The fast mapping which we used
above employs fast heuristics which are purposely limited to linear mapping complexity.
Traditional mapping software uses different techniques, such as simulated annealing, which
will consider simultaneous constraints to minimize resource usage and routed path length. It
will be worthwhile to understand how well tasks can be mapped to membersof the TSFPGA
architecturewhenwearewillingto takethetimeto performaqualitymappingjob. Innormal
usage, one might use the fast mapping during design development and debug, then use the
slower, higher quality mapping once a design becomes stable.
Explore deﬁned architectural space – We have focussed on a single point in the deﬁned
architectural space. It will be worthwhile to map tasks across various architectural points to
determine the level of connectivity required to meet typical throughput and latency require-
ments, and to determine the most area efﬁcient implementation points.
Multichip extension – The speciﬁcs explored here focus on single-chip implementations, but
there is a natural extension to multiple chip systems. The dimensional routing organization
used on-chip should extend between chips when an array of TSFPGA components is em-
ployed. Partitioning, placement, and routing amongst components will be very similar to
partitioning, placement, and routing amongst the subarrays on a single TSFPGA component.
The boundary i/o will provide a more severe bottleneck between subarrays on distinct chips,
256requiring heavier time-multiplexing. Inter-TSFPGA routes will require more pipeline stages
than inter-subarray routes.
25713. MATRIX
Throughout this work, we have seen the central role which instructions play in general-purpose
computing architectures. In Section 8.6, we saw a large architectural space characterized by the
number of distinct control streams, datapath granularities, and instruction depth. In Chapters 4,
8, and 9, we reviewed this rich architectural space for general-purpose computing devices. We
saw that the choices made in these parameters are what distinguish conventional general-purpose
architectures, and we saw that it is these choices that deﬁne the circumstances under which a given
general-purpose architecture is most efﬁcient. In Section 9.5, we saw that even limiting ourselves
to datapath granularity and instruction depth, it is not possible to select a single pair of these
parameters which yielded a robust architecture – that is, there is no single selection point whose
area requirement will be above a bounded fraction of the optimal selection of these two parameters
for any task.
Every conventional general-purpose architecture reviewed in Chapter 4 and summarized in
Table 8.1 takes a stand on instruction resources by selecting:
1. control stream to instruction ratio
2. local instruction depth
3. instruction to datapath element ratio
These selections are made and ﬁxed at fabrication time and characterize the device for its entire
lifetime. Unfortunately, most real computations are neither purely regular nor irregular, and real
computations do not work on data elements of a single data size. Typical computing tasks spend
mostoftheirtime ina verysmallportionofthecode. Inthekernelwheremostofthecomputational
time is spent, the same computation is heavily repeated making it very regular such that a shallow
instructionstoreisappropriate. Therestofthecodeisusedinfrequentlymakingitirregularsuchthat
it is suited to a deep instruction store. Further, in systems, a general-purpose computational device
is typically called upon to run many applications with differing requirements for datapath size,
regularity, and control streams. This broad range of application requirements makes it difﬁcult, if
notimpossible,to achieverobustandefﬁcient performanceacross entireapplicationsorapplication
setsbyselectingasinglecomputationaldevicewhichhasarigidlyselectedinstructionorganization.
In this chapter, we introduceMATRIX, a novel, general-purposecomputing architecturewhich
does not take a pre-fabrication stand on the assignment of space, distribution, and control for
instructions. Rather, MATRIX allows the user or application to determine the actual organiza-
tion and deployment of resources as needed. Post-fabrication the user can allocate instruction
stores, instruction distribution, control elements, datapaths, data stores, dedicated and ﬁxed data
interconnect, and the interaction between datastreams and instruction streams.
We introduceMATRIX andthe concepts behind it. We groundthe abstract concepts behind the
MATRIX architecture with:
258a concrete microarchitecture
an illustrative application example
model estimates and prototype implementation highlights
architecture efﬁciencies for sample image processing tasks
MATRIX was developed jointly by Ethan Mirsky and Andr´ e DeHon. Andr´ e oversaw the architec-
ture and guided the architectural deﬁnition, while Ethan deﬁned the detailed microarchitecture and
developedtheVLSIimplementation. MATRIXwasﬁrstdescribedpubliclyin[MD96]andportions
of this chapter are taken from that description. Ethan details the MATRIX microarchitecture in his
thesis [Mir96].
25913.1 MATRIX Concepts
MATRIX is designed to maintain ﬂexibility in instruction control. Primary instruction distri-
bution paths are not deﬁned at fabrication time. Instruction memories are not dedicated to datapath
elements. Datapath widths are not fully predetermined. MATRIX neither binds control elements
to datapaths nor predetermines elements that can only serve as control elements.
To provide this level of ﬂexibility, MATRIX is based on a uniform array of primitive elements
andinterconnectwhichcanserveinstruction,control,anddatafunctions. Asinglenetworkisshared
by both instruction and data distribution. A single integrated memory and computing element can
serve as an instruction store, data store, datapathelement, or controlelement. MATRIX’s primitive
resources are, therefore, deployable, in that the primitives may be deployed on a per-application
basis to serve the role of instruction distribution, instruction control, and datapath elements as
appropriate to the application. This allows tasks to have just as much regularity, dynamic control,
or dedicated datapaths as needed. Datapaths can be composed efﬁciently from primitives since
instructions are not prededicated to datapath elements, but rather delivered through the uniform
interconnection network.
The key to providing this ﬂexibility is a multilevel conﬁguration scheme which allows the
device to control the way it will deliver conﬁguration information. To ﬁrst order, MATRIX uses
a two level conﬁguration scheme. Traditional “instructions” direct the behavior of datapath and
networkelementsonacycle-by-cyclebasis. Metaconﬁgurationdataconﬁguresthedevicebehavior
at a more primitive level deﬁning the architectural organization for a computation. Metaconﬁg-
uration data can be used to deﬁne the traditional architectural characteristics, such as instruction
distribution paths, control assignment, and datapath width. The metaconﬁguration“wires up” con-
ﬁgurationelementswhichdonotchangefromcycle-to-cycleincluding“wiring”instructionsources
for elements whose conﬁguration does change from cycle-to-cycle.
260A_ADR B_ADR
A PORT B PORT
MODE
DATA
WE
ALU
Function
(Fa)
Memory
Function
(Fm)
Address/
Data A
Address/
Data B
BFU
Core
A B
Fa Fm
Out
Floating
Port 1 (FP1)
L3 Control
Lines
Incoming
Network Lines
(L1, L2, L3)
Incoming
Network Lines
(L1, L2, L3)
Switch 1 (N1)
Network Network
Switch 2 (N2)
Level 2, 3
Network Drivers
Network
Level 1
Network Drivers
Network Port A
Network Port B
Control
Logic
Carry In Carry Out
ALU Function Port
Control
Logic
A_in B_in
C_in C_out
F_sel ALU
Out
Memory Function Port
Memory
Block
Floating
Port 2 (FP2)
Figure 13.1: MATRIX BFU
13.2 MATRIX Architecture Overview
In this section we ground the more abstract concepts of the previous section with a concrete
MATRIX microarchitecture. This concrete microarchitecture will be the focus of the remainder of
the chapter. The concrete microarchitecture is based around an array of identical, 8-bit primitive
datapath elements overlayed with a conﬁgurable network. Each datapath element or functional
unit contains a 256 8-bit memory, an 8-bit ALU and multiply unit, and reduction control logic
including a 20 8 NOR plane. The network is hierarchical, supporting three levels of interconnect.
Functional unit port inputs and non-local network lines can be statically conﬁgured or dynamically
switched.
13.2.1 BFU
The Basic Functional Unit (BFU) is shown in Figure 13.1. The BFU contains three major
components:
256 8 memory– the memory can function either as a single 256-byte memory or as a dual-
ported, 128 8-bit memory in register-ﬁle mode. In register-ﬁle mode the memory supports
two reads and one write operation on each cycle.
8-bit ALU – the ALU supports the standard set of arithmetic and logic functions including
NAND, NOR, XOR, shift, and add. With optional input inversion, this extends to include
OR, AND, XNOR, and subtract. A conﬁgurable carry chain between adjacent ALUs allows
cascading of ALUs to perform wide-word operations. The ALU also includes an 8 8
multiply-add-add operation; the multiply operation takes two operating cycles to deliver its
261Neighborhood Neighborhood
R
NOR Plane
(1/2 PLA)
Control Bit Control Byte
8
Floating Port I Floating Port II
BFU Output
1
9
Comp/Reduce I Comp/Reduce II
Comp/Reduce
R
Select 4
R
13 8 4 8 8
Figure 13.2: BFU Control Logic
results over the 8-bit BFU output, delivering the low 8 bits of the product on the ﬁrst cycle
and the high 8 bits on the second cycle.
Control Logic – the control logic is composed of: (1) a local pattern matcher for generating
local control from the ALU output (Figure 13.2 Left), (2) a reduction network for generating
local control (Figure 13.2 Middle), and (3) a 20-input, 8-output NOR block which can serve
as halfof a PLA (Figure13.2 Right). Thelocal patternmatcher is usedto reducethedatapath
value to a condition bit such as zero detect, positive or negative test, or carry detect. The
control bit producedby thereduction networkis used to select among controlcontexts which
wedescribedinSection13.2.4. TheNORplaneallowsustoperformprogrammable,bit-wise
logically functions. This is the primary place where the datapath can be broken down to bits
or composed from bits. Since the NOR plane acts on the bit level, it can be used to permute
the bits in a byte or perform extract or deposit operations between two bytes.
MATRIX operation is pipelined at the BFU level with a pipeline register at each BFU input
port. A single pipeline stage includes:
1. Memory read
2. ALU operation
3. Memory write and local interconnect traversal – these two operations proceed in parallel
The BFU can serve in any of several roles:
I-store – Instruction memory for controlling ALU, memory, or interconnect functions
Data memory – Read/Write memory for storage and retiming of data
RF+ALU slice – Byte slice of a register-ﬁle-ALU combination
ALU function – Independent ALU function
The BFU’s versatility allows each unit to be deployed as part of a computationaldatapath or as part
of the memory or control circuitry in a design.
262Nearest Neighbor Interconnect
Length Four Bypass Interconnect
Figure 13.3: MATRIX Network
13.2.2 Network
The MATRIXnetworkis ahierarchicalcollection of8-bit busses. Theinterconnectdistribution
resembles traditional FPGA interconnect. Unlike traditional FPGA interconnect, MATRIX has the
option to dynamically switch network connections. The network includes:
1. Nearest Neighbor Connection (Figure 13.3 Left) – A direct network connection is pro-
vided between the BFUs within two manhattan grid squares. Results transmitted over local
interconnect are available for consumption on the following clock cycle.
2. LengthFourBypassConnection(Figure13.3Right)–EachBFUsupportstwo connections
into the level two network. The level two network allows corner turns, local fanout, medium
distance interconnect, and some data shifting and retiming. Travel on the level two network
may add as few as one pipeline delay stage between producer and consumer for every three
level two switches included in the path. Each level two switch may add a pipeline delay
stage if necessary for data retiming.
3. Global Lines – Every row and column supports four interconnect lines which span the
entire row or column. Travel on a global line adds one pipeline stage between producer and
consumer.
Notice that the same network resources deliver instructions, data, addresses, and control to the
BFU ports. All of the eight BFU input ports (Figure 13.1) are connected to this same network, and
all BFU outputs are routed through this network.
13.2.3 Port Architecture
2635
FPout
1
10
8 1
Local Output 1x8
8x8
8x8
8
8
12x8
1x8
R
Network Inputs
30x8
Control Byte
Control Bit
Configuration
Word A
Configuration
Word B
Register on
A,B Ports Only
BFU (A,B)
Network Drivers (N1,N2)
Figure 13.4: BFU Port Architecture
The MATRIX port conﬁguration is one of the keys to the architecture’s ﬂexibility. The input
ports are the primary source of MATRIX’s metaconﬁguration. Figure 13.4 shows the composition
of the BFU network and data ports. Each port can be conﬁgured in one of three major modes:
1. Static Value Mode – The value stored in the port conﬁguration word is used as a static value
driven into the port. This is useful for driving constant data or instructionsinto a BFU. BFUs
conﬁgured simply as I-Stores or memories will have their ALU function port statically set
to pass memory output data. BFUs operating in a systolic array might also have their ALU
function port set to the desired operation. For regular operations a BFU may be dedicated to
that function and, in so doing, requires no instruction memory be allocated for control.
2. Static Source Mode – The value stored in the port conﬁguration word is used to statically
select the networkbus providing data for the appropriate port. This conﬁguration is usefulin
wiring static control or datapaths. Static port conﬁguration is typical of FPGA interconnect.
3. Dynamic Source Mode–The valuestored in theportconﬁgurationwordis ignored. Instead
the output of the associated ﬂoating port (see Figure 13.1) controls the input source on a
cycle-by-cycle basis. This is useful when datapath sources need to switch during normal
operation. For example, during a relaxation algorithm, a BFU might need to alternately take
input from each of its neighbors.
The ﬂoating port and function ports are conﬁgured similarly, but only support the static value and
static source modes.
13.2.4 Port Contexts
Matrix metaconﬁguration information is also multicontext in two ways.
264Control As shown in Figure 13.4, each port actually has two conﬁguration words selected by a
control bit. This control bit is generate by the NOR plane or reduction network (Comp/Reduce II)
in the control portion of the BFU (Figure 13.2). This arrangement allows control data to locally
affect each BFU’s operation.
One common use of this control function is in a BFU which operates as the program counter.
A typical program counter holds its value (PC) on the BFU output. In normal operation, the BFU
simplyincrementsitscurrentvalue(PC=PC+1). Whenabranch testsucceeds,theprogramcounter
BFU loads its value from its own memory (PC=mem[PC]) rather than incrementing. To arrange
this, control logic is set to route the “take branch” condition on the control bit. One control context
is used for the not taken branch caseand simply conﬁguresthe BFUto incrementthe PC. The other
controlcontext is used for the taken branch condition and conﬁguresthe BFU to use the current PC
as an address into memory for a read operation.
Since the control bit can come from the NOR plane, it can be slaved to any bit on any bus
distributed to the BFU. This allows a controller to use a BFU or collection of BFUs as two context
devices. A single datapath byte can control up to eight such BFUs independently if each BFU is
conﬁgured to select a distinct bit from the control byte.
Global Additionally,theentiremetaconﬁgurationdata is replicatedmultiple timesand controlled
by a single, array-wide context select similar to the DPGA (Chapter 10). In our current mi-
croarchitecture we have four global context, two of which are hardwired and two of which are
programmable. The hardwired contexts are intended for bootstrapping and device programming.
Theyconﬁgurethe entiredevice into a known conﬁgurationof datapathsso that metaconﬁguration,
conﬁguration, and initial data can be loaded into the array. The two programmable contexts allow:
1. background loading of metaconﬁguration data
2. assembly of new global context data without affecting the current, operating context
3. atomic swap between assembled conﬁguration
The global contexts can also be used to provide DPGA-style multicontext swapping between
conﬁgurations. Coupling the two programmable contexts with the two control contexts, the entire
array can be treated as a four context device without dedicating BFU memory for context data.
13.2.5 Metaconﬁguration Conﬁguration
The metaconﬁguration data for each BFU can be written by a BFU write operation. The
metaconﬁguration data is in a different address space from the BFU local memory. Access to the
metaconﬁguration data versus the normal BFU memory is controlled by the the instruction issued
to the BFU memory function port (Figure 13.1). This arrangement allows the metaconﬁgurationto
be loaded in one of several ways:
1. Ahardwired contextcan maketheentiredevice looklike amemoryso thatan externaldevice
can perform memory-mapped writes to conﬁgure metaconﬁguration and conﬁguration data.
2. A hardwired context can setup the device to bootstrap load a conﬁguration from a slave
memory.
2653. AcontrollercanbeconﬁguredonthearraywhichcanwritemetaconﬁgurationtootherBFUs.
There can be any number of controllers controlling any subsets of the array limited only by
raw resource availability. More controllers can increase reconﬁguration bandwidth at the
expense of taking BFU resources away from datapath computations.
4. A BFU may write to its own metaconﬁguration. This usually requires some assist from
other BFUs. However, with the control contexts, there are useful conﬁgurations where a
single BFU may reconﬁgure portions of itself. For example, a BFU in a systolic datapath
could be conﬁgured primarily to perform one, ﬁxed ALU operation. When a control event is
signaled, its second control context could reload the ALU function port conﬁguration from
its local memory. When it returns to datapath operation, the BFU now performs the new
operation. This basic reconﬁgurationscheme allows MATRIX to efﬁciently handle a variety
of quasistatic instruction streams.
Note that the existence of two programmable, global contexts is useful for providing atomic,
coordinated,array-widecontextswaps. Intypical use, thearraywould operatein one contextwhile
writing new conﬁguration data into an unused, programmable conﬁguration context. Once that
context was fully programmed, the global context select would change effecting the array-wide
switch.
13.2.6 Time-Switching
MATRIX ports can also operate in a time-switched mode, inspired by the time-switched input
register (Section 12.1). In Chapter 12, we saw that the ability to latch and hold input values at
designated microcycles, along with switched interconnect, allowed us to minimize the constraints
required during design mapping and thereby perform physical mapping quickly. Each MATRIX
porthas atime matching unit asdoes memory writeback. Whenmetaconﬁgurationsets aBFU into
time-switched mode, each input is loaded only on its programmed microcycle as with TSFPGA.
The timestep for MATRIXis broadcast alonga designatedglobal line. In time-switchingmode, the
metaconﬁgurationdedicates thesegloballines andprovidesfor theproperdistributionof atimestep
value. Typically, the remaining global lines will be dynamically switched to provide the necessary
interconnectbetweenBFUs. Insituationswherelightmultiplexingis allthat isrequired,thecontrol
contexts may provide sufﬁcient switched routing. For more heavily shared switching resources,
global and bypass lines can be time-switched, with each getting its own BFU instruction store to
control its operation. Time-switched routing will, of course, slow down MATRIX operation. This
mode is intended primarily for fast, hands-off, automatic mapping during early development.
13.2.7 Resource Deployment Granularity
The primitives in the architecture do deﬁne a granularity at which resources must be deployed.
Datapathsandnon-local controlpaths can only come in 8-bit multiples. Contextmemories comein
256 instruction deep chunks. Compute elements come as 8-bit ALUs with 128-word register ﬁles.
Due to the ﬂexible instruction distribution introduced above and discussed further in Sec-
tion 13.4, MATRIX’s granularity does not have the same kind of effects as conventional architec-
tures (Chapter 9). For task requirements below 8-bits, the datapath suffers similar to traditional
266architectures. For taskrequirementsabove 8-bits,atmost7-bitsofthe datapathevergo wasted, and
MATRIXdoes not waste space on instruction stores holding redundant data as would conventional
8-bit architectures.
13.2.8 Additional Information
For additional detail on the MATRIX microarchitecture see [Mir96].
267Add
Pass
Add
Mult
Add
Pass
Add
Mult
Add
Pass
Add
Mult
Add
Pass
Add
Mult
Add
Pass
Add
Mult
Add
Pass
Add
Mult
Add
Pass
Add
Mult
Add
Pass
Add
Mult
xi
yi
(8 bit)
(16 bit)
Figure 13.5: Systolic Convolution Implementation
13.3 Usage Example: Finite-Impulse Response Filter
In this section we present a range of implementation options for a single task, convolution,
in order to illustrate MATRIX usage and further ground the features of this architecture. The
convolution task is as follows: Given a set of weights 1, 2, and a sequence of
samples 1, 2, , compute a sequence of results 1, 2, according to:
1 2 1 1 13 1
Systolic Figure 13.5 shows an eight-weight ( 8) convolution of 8-bit samples accumulating
a 16-bit result value. The top row simply carries sample values through the systolic pipeline. The
middle rowperforms an 8 8 multiply againstthe constants weights, ’s, producinga 16-bit result.
The multiply operation is the rate limiter in this task requiring two cycles to produce each 16-bit
result. The lower two rows accumulate results. In this case, all datapaths (shown with arrows in
thediagram)are wiredusing static sourcemode(Figure13.4). Theconstant weightsare conﬁgured
as static value sources to the multiplier cells. Add operations are conﬁgured for carry chaining to
perform the required 16-bit add operation. For a -weight ﬁlter, this arrangement requires 4 cells
and produces one result every 2 cycles, completing, on average, 2 8 8 multiplies and 2 16-bit
adds per cycle.
In practice, we can:
1. Use the horizontal level-twobypass lines for pipeliningthe inputs, removing the need for the
top row of BFUs simply to carry sample values through the pipeline.
2. Use both thehorizontaland verticallevel-twobypass lines to retime thedata ﬂowingthrough
the add pipeline so that only a single BFU adder is needed per ﬁlter tap stage.
3. Use three I-stores and a program counter (PC) to control the operation of the multiply and
add BFUs, as well as the advance of samples along the sample pipeline.
The -weight ﬁlter can be implemented with only 2 4 cells in practice.
268ALU
PC
b I
alu I
mf I
src I
src I
yi
xi
(8 bit)
(16 bits output
  over 2 cycles)
a I
Label ALU Op PC
newsample Rxp Rxp + 1 ; Match 1 (6 bits) BNE xpcont1
Rxp new (pipelined branch slot)
Rxp 65
xpcont1 Rxp new
Rs Rxp
Rwp 1
Rw Rwp
Rs Rs Rw
Rw -continue
Rl Rs; Match false BNE enterloop
Rh Rw (pipelined branch slot)
innerloop Rs Rs Rw
Rw -continue
Rl Rs + Rl
Rh Rw +-continue Rh
enterloop Rxp Rxp + 1 ; Match 1 (6 bits) BNE xpcont2
Rs Rxp (pipelined branch slot)
Rxp 65
Rs Rxp
xpcont2 Rwp Rwp + 1 ; Match 1 (6 bits) BNE innerloop
Rw Rwp (pipelined branch slot)
last read Rl ; Match false BNE newsample
read Rh (pipelined branch slot)
Figure 13.6: Microcoded Convolution Implementation
Microcoded Figure 13.6 shows a microcoded convolution implementation. The coefﬁcient
weights are stored in the ALU register-ﬁle memory in registers 1 through and the last samples
are storedin a ring buffer constructed from registers65 through 64 . Six other memory location
(Rs, Rsp, Rw, Rwp, Rl, and Rh) are used to hold values during the computation. The ALU’s A and
B ports are set to dynamic source mode. I-store memories are used to drive the values controlling
the source of the A and B input (two memories), the values fed into the A and B inputs ( , ),
the memory function ( ) and the ALU function ( ). The PC is a BFU setup to increment its
output value or load an address from its associated memory as described in Section 13.2.4.
The implementation requires 8 BFUs and produces a new 16-bit result every 8 9 cycles.
The result is output over two cycles on the ALU’s output bus. The number of weights supported
269Xptr
w I
x I
src I alu I PC
Wptr X +
alu I Ia
yi
xi (8 bit)
(16 bits output
  over 2 cycles)
Label Xptr unit Wptr unit PC MPY unit +-unit
ﬁrstsample Xptr 64 Wptr 0
output Xptr output Wptr Xptr new
nextsample Xptr++ MOD 64 Wptr++ Xptr Wptr
output Xptr output Wptr -continue Rlow MPY-result
Xptr++ MOD 64 Wptr++ Xptr Wptr Rhigh MPY-result
output Xptr output Wptr -continue Rlow Rlow + MPY-result
innerloop Xptr++ MOD 64 Wptr++; Match BNE innerloop Xptr Wptr Rhigh Rhigh + MPY-result
output Xptr output Wptr (pipelined branch slot) -continue Rlow Rlow + MPY-result
last output Xptr output Wptr Xptr Wptr Rhigh Rhigh + MPY-result
Xptr++ MOD 64 Wptr 0; Match false BNE nextsample -continue Rlow Rlow + MPY-result
output Xptr output Wptr (pipeline branch slot) Xptr new Rhigh Rhigh + MPY-result
Boxed values in last are the pair of output bytes at the end of each convolution.
Figure 13.7: Custom VLIW Convolution Implementation
is limited to 61 by the space in the ALU’s memory. Longer convolutions (larger ) can be
supported by deploying additional memories to hold sample and coefﬁcient values.
CustomVLIW(HorizontalMicrocode) Figure13.7showsaVLIW-styleimplementationofthe
convolution operation that includes application-speciﬁc dataﬂow. The sample pointer (Xptr) and
the coefﬁcient pointer (Wptr) are each given a BFU, and separate ALUs are used for the multiply
operationandthesummingaddoperation. Thisconﬁgurationallowstheinnerlooptoconsistofonly
two operations, the two-cycle multiply in parallel with the low and high byte additions. Pointer
increments are also performed in parallel. Conventional digital signal processors are generally
designed to handle this kind of ﬁltering problem well, and, not coincidentally, the datapath used
here is quite similar to modern DSP architectures. Most of the I-stores used in this design only
contain a couple of distinct instructions. With clever use of the control PLA and conﬁguration
words, the number of I-stores can be cut in half making this implementation no more costly than
the microcoded implementation.
As shown, the implementationrequires 11 BFUs and produces a new 16-bit resultevery 2 1
cycles. As in the microcoded example the result is output over two cycles on the ALU output bus.
The number of weights supported is limited to 64 by the space in the ALU’s memory.
VLIW/MSIMD Figure 13.8showsaMultiple-SIMD/VLIWhybridimplementationbasedon the
controlstructurefromtheVLIW implementation. Asshownin theﬁgure,sixseparateconvolutions
are performed simultaneously sharing the same VLIW control developed to perform a single
convolution, amortizing the cost of the control overhead. To exploit shared control in this manner,
270Xptr
w I
x I
src I alu I PC
Wptr
alu I Ia
X
+
X X X X X
+ + + + +
i x1
y1 i
i x2
y2 i y3 y4 i y6
x6 i x4 i x3 i
i i y5 i
x5 i
Figure 13.8: VLIW/MSIMD Convolution Implementation
the sample data streams must receive data at the same rate in lock step.
Whensampleratesdiffer, separatecontrolmayberequiredforeachdifferentrate. Thisamounts
to replicating the VLIW control section for each data stream. In the extreme of one control unit
per data stream, we would have a VLIW/MIMD implementation. Between the two extremes, we
have VLIW/MSIMD hybrids with varying numbers of control streams according to the application
requirements.
Comments Of course, many variations on these themes are possible. The power of the MATRIX
architecture is its ability to deploy resources for control based on application regularity, throughput
requirements, and space available. In contrast, traditional microprocessors, VLIW, or SIMD
machines ﬁx the assignment of control resources, memory, and datapath ﬂow at fabrication time,
while traditional programmable logic does not support the high-speed reuse of functional units to
perform different functions.
27113.4 Flexible Instruction Distribution
MATRIX supports ﬂexbile allocation of instruction control resources as a consequence of the
BFU, network, and port architecture described in Section 13.2.
Instruction Depth We can directly select an instruction depth of 1 or 256. If the instruction
does not need to change, we can directly conﬁgure it via static value in the port metaconﬁguration.
If the instruction changes between only two values, we can use the control context. In certain
situations,wecan usetheglobalcontexts tosupportup to fourcontexts usingthemetaconﬁguration
contexts. When instruction need to change among more than a few values, we can allocate a BFU
as an instruction store and use static source mode to conﬁgure said distribution. If more than 256
instructions are needed, we can use dynamic source mode to expand the selection to one of several
different memory sources, allowing a large instruction space.
Note that conventional FPGAs are characterized by an instruction depth of one, while an
instruction depth of 256-1024 is typical for conventional processor architectures.
Datapath Granularity We can control the datapath granularity to any multiple of 8-bits. The
network allows fanout at all three levels of the hierarchy. To build an 8 -bit wide datapath, we
need only conﬁgure the BFUs used as datapath elements to take their instructions from the same
instruction memories (See Figure 13.9). Notice that this is not the same as having a conventional
microarchitecturewith 8, as introducedin Chapters8 and 9. In a conventionalcase, each 8-bit
datapath element would have its own instruction memory, whereas, for MATRIX, we get to use a
single instruction memory for all datapath elements (See Figure 13.10).
Noticealsothattheabilitytoassigninstructionmemoriestocomposeddatapathsisalsodifferent
from the segmentable datapaths in modern multimedia processors (Section 4.7), multigauge SIMD
architectures (e.g. [Sny85] [BSV 95]), or the Kartashev dynamic architecture [KK79]. In these
architectures, all the bit processing elements in a predeﬁned datapath perform the same operation.
These generally exhibit SIMD instruction control for the datapath, but can be dynamically or
quasistatically reconﬁgured to treat the bit datapath as , -bit words, for certain, restricted,
values of . MATRIX does not have to perform the same ALU function across all datapath
segments like these architectures.
Instruction Streams The number of instruction streams on a MATRIX component is limited
only by the availability of resources. If the entire operation is efﬁciently handled by a systolic
architecture, no resources, BFUs or interconnect need be sacriﬁced to control. For highly regular
operations where SIMD control is effective, MATRIX need only dedicate a single set of BFUs
to broadcast the instructions to the rest of the array. As the application needs more, independent
instruction streams, more BFUs can be allocate to provide separate instruction streams. Like
MSIMD (e.g. [Bri90, Nut77]) or MIMD multigauge [Sny85] designs, the array can be broken into
units operating on different instructions. Synchronization between the separate functions can be
lock-stepVLIWorcompletelyorthogonaldependingontheapplication. UnliketraditionalMSIMD
or multigauge MIMD designs, the control processors and array processors are built out of the same
272ALU PC
alu I IA IB
ALU PC
alu I IA IB
ALU
ALU PC
alu I IA IB
ALU ALU
PC alu I IA IB
ALU ALU ALU ALU
Here we show 8-, 16-, 24-, and 32-bit datapaths built on top of MATRIX. Conﬁg-
urable instruction distribution allows multiple datapath BFUs to share a single set
of instruction stores.
Figure 13.9: Conﬁgurable Datapaths
building block resources and networking. Consequently, more array resources are available as less
control resources are used (See Figure 13.11).
Control Streams Similarly, MATRIX can handle any number of independent control streams.
Each can have their own program counter realizing a MIMD architecture, or they can all be slaved
to a single program counter realizing a VLIW architecture (See Figure 13.12). Between these
extremes, any number of instruction streams may be associated with each program counter in the
samewaythat anynumberofdatapathelementscanbe slavedto asingleinstructionstream. Again,
as we noted for datapath granularityand instructionstreams, controlresources comefrom the same
pool of resources as datapath elements – as an application can be described with fewer control
streams, more BFUs are available to serve as datapath elements.
273ALU PC
alu I IA IB
ctrl a b
A A
ALU PC
alu I IA IB
ALU
ctrl a b
A A
ctrl a b
A A
ALU PC
alu I IA IB
ALU ALU
ctrl a b
A A
ctrl a b
A A
ctrl a b
A A
PC alu I IA IB
ALU ALU ALU ALU
ctrl a b
A A
ctrl a b
A A
ctrl a b
A A
ctrl a b
A A
MATRIX Convetional Array
While using an 8-bit primitive datapath element, the MATRIX microarchitecture
is very different from a conventional architecture with 8. Conventional ar-
chitectures rigidly bind instructions and control with datapaths, whereas MATRIX
deploystheresourcesseparately. Consequently,MATRIXcanshare controlandin-
struction memory across a composed datapath, whereas conventional architectures
do not allow such sharing.
Figure 13.10: Datapath Composition: MATRIX versus Conventional 8 Architecture
274ALU
PC
alu I
IA
IB ALU
ALU
PC
alu I
IA
IB ALU
ALU
PC
alu I
IA
IB ALU
ALU
PC
alu I
IA
IB ALU
ALU
PC alu I IA IB
ALU ALU ALU
ALU
PC alu I IA IB
ALU ALU ALU
ALU
PC alu I IA IB
ALU ALU ALU
PC alu I IA IB
ALU ALU ALU ALU
PC alu I IA IB
ALU ALU ALU ALU
ALU ALU ALU ALU
ALU ALU ALU ALU
ALU
PC alu I IA IB
ALU ALU ALU
ALU ALU ALU ALU
ALU ALU ALU ALU
ALU ALU ALU ALU
ALU ALU ALU ALU
Here we show one, two, three, and four instruction streams controlling a set of
16-bit datapaths. Givena ﬁxedarraysize, as thenumberof independentinstruction
streams decrease, more array resources can be dedicated to SIMD datapaths.
Figure 13.11: Conﬁgurable Instruction Streams
275ALU PC
alu I IA IB
ALU ALU PC
alu I IA IB
ALU
ALU PC
alu I IA IB
ALU ALU PC
alu I IA IB
ALU
ALU PC
alu I IA IB
ALU ALU PC
alu I IA IB
ALU
ALU PC
alu I IA IB
ALU
ALU
alu I IA IB
ALU
alu I IA IB
ALU ALU
ALU alu I
IA
IB ALU
alu I
IA
IB
ALU
ALU
alu I
IA
IB
ALU
ALU
alu I IA IB
ALU ALU
TOP – MIMD control with six, 16-bit data streams, each with indepdent control
Bottom – VLIW control with seven, 16-bit data streams directed by a single control unit
Figure 13.12: Conﬁgurable Control Streams
276Configuration
Memory
Multiplier
Control
Logic
ALU
Registers Registers
   Main 
Memory
Configuration
Memory
Network Drivers
OR Plane
Network
Switches
Network
Switches
Technology 0.5 CMOS
BFU Size 1.5mm 1.2mm
(1.8mm2 29M 2)
Data Width 8-bit
Memory 256 8
Cycle 10 ns (estimate)
Figure 13.13: MATRIX BFU Composition
Unit Fraction
Main Memory 26%
ALU+multiplier 7%
Switches and Drivers 42%
Registers including time-switch 6%
Port Conﬁg 9%
Control (with conﬁg) 10%
Table 13.1: Area Breakdown for Prototype MATRIX BFU Implementation
13.5 MATRIX Implementation
Figure 13.13showsthecompositionoftheprototypeBFUdevelopedby EthanMirsky[Mir96],
along with its size and projected performance. Table 13.1 shows the area breakdown from the
prototype implementation. As described in Section 13.2, MATRIX operation is pipelined at the
BFUlevelallowinghighspeedimplementation. Withonlyasmallmemoryread,anALUoperation,
and local network distribution, the basic cycle rate can be quite small – at least comparable to
microprocessor clock rates. 100 MHz operation is the target for the prototype design. At 1.8mm2,
100 BFUs ﬁt on a 17mm 14mm die. A 100 BFU MATRIX device operating at 100MHz has a
peak performance of 1010 8-bit operations per cycle (10 Gop/s).
277Unit Elements Element Size Total Size Fraction
Port Switching 30 8 8=1920 2.5K 2 4.8M 2 46%
Main Memory 256 8=2048 1.2K 2 2.5M 2 24%
Conﬁg. Memory 135 8=1080 1.2K 2 1.3M 2 12%
(NOR) 20 8
Control (match) +9 1
(control bit) +21 1 = 190 3K 2 0.6M 2 6%
ALU 8b 20K 2 0.2M 2 2%
MPY 8 8=64 7K 2 0.5M 2 5%
Registers 8 8+16=80 4K 2 0.3M 2 3%
TS 9 16K 2 0.15M 2 1%
Sum 10-11M 2
Table 13.2: MATRIX BFU Composition Estimate
MATRIXissufﬁcientlydifferentfromconventionalarchitecturesthatourmodelfromChapter9
does not quite apply. We can account for the speciﬁc composition of our microarchitecture.
Table 13.2 summarizes the constituent elements of the MATRIX BFU along with estimated areas.
The MATRIX size estimate is about one-third the size of the prototype implementation, suggesting
there is considerable room for improvement relative to the prototype design. The prototype is
a ﬁrst-generation, one student, university prototype of a novel architecture. As such, it is not
surprising that it is not the most compact design.
Nevertheless, both area views agree on rough area proportions. Switches and drivers occupy
roughly45%ofthearea. ThemainBFUmemoryaccountsfor25%ofBFUarea. Metaconﬁguration
makes up roughly 10% of the BFU. The ALU and multiplier composes only 7% of the area.
27813.6 Building Block Efﬁciency
The MATRIX BFU serves several roles. It is interesting to consider its efﬁciency in each of
these roles.
13.6.1 Memory
MATRIX packs 2048 RAM bits into 28.8M 2 in the prototype or, perhaps, 10M 2 in an
optimized design. If we only use the BFU for its memory array, each memory bit cell is effectively
14K 2, or 5K 2, respectively. Of course, the MATRIX memory only comes in 256 8 blocks and
will, therefore,be less dense as smaller memories or memories which are not even multiples of this
size are needed.
VersusCustom Memory Since memory only accounts for 25% of theMATRIXarray, MATRIX
memory is only one-fourth as dense as a custom memory of the same size.
Versus Gate Array Memory A RAM cell implemented in a gate-array process is roughly 6K 2
(e.g. [Fos96]). This size comparable to the amortized bit area according to the model (5K 2) and
is half of the size of the amortized bit cell area in the current prototype (14K 2).
VersusXilinx 4K Memory The Xilinx 4K series [Xil94b] can use each CLB as 32 bits of RAM.
From Table4.13, we know a Xilinx 4KCLB is 1.25M 2, making eachmemory bit roughly 39K 2,
which is 3-7 larger than the amortized MATRIX RAM bit area.
13.6.2 Datapath Elements
Versus Hardwired ALU The ALU and multiplier make up only 7% of the BFU area. This
suggests a datapath of BFUs could be a good 10-20 less dense than a full custom implementation
of the same task.
Versus Systolic, Reconﬁgurable ALUs In systolic dataﬂow applications the ALU may be used
asa functionalunitwithoutthememory. The25%oftheBFU areawhichisin thememory sitsidle,
as well as the 1
8 45% 6% for the extra function port. The control logic constitutes another
6-10% of the BFU area. Consequently, the MATRIX BFU is roughly 1.5 larger than we might
see in a pure mesh of reconﬁgurable ALUs, or perhaps in an architecture where the memory and
ALU were independent resources.
ALUBit Ops InTable4.24, we havealreadynotedthat theprototypeMATRIXachieves28 ALU
Bit Ops 2s which is roughly 3 the computational density of processors (See Table 4.2). At the
same time this is 4 lower than the peak computational density offered by single-context FPGAs
(See Table 4.13). If we can realize the compaction suggested by the model, MATRIX can achieve
80 ALU Bit Ops 2s, bringing its peak density almost comparable to FPGAs.
279Adder A cascaded 16-bit add can occur in one cycle on 2 MATRIX BFUs. Assuming the BFUs
areused only for theadd, this consumes a capacity of 28 8M 2 2 10ns 0 58 2s. An XC4005-5
performs a 16-bit add in 20.5ns on 9 CLBs, taking a capacity of 0.23 2s, which is only 2 more
efﬁcient than the MATRIX prototype.
Multiply In Chapter 5 we reviewed custom and programmable multiply implementations. Even
with the large prototype area, MATRIX achieved comparable multiply density to the best pro-
grammable devices (See Table 5.3). MATRIX is 10-100 less computationally dense at multipli-
cation than full custom multiplies, which is consistent with the fact that multiplier occupies only
3% of the BFU area. At the same time the hardwired multiplier makes the MATRIX prototype
3-10 more computationally dense then FPGAs on multiply operations.
28013.7 Image Processing Examples
To get a concrete view of MATRIX application performance, we will examine several image
processing primitives implemented in custom and semi-custom silicon and compare them to MA-
TRIX, FPGA, and microprocessor implementations of the same task. LSI’s real-time DSP chip set
[Rue89] is used to deﬁne the tasks and provide the custom implementations. The real-time chip set
includes:
1. Variable Shift Register (VSR)
2. Rank Value Filter (RVF)
3. Binary Filter and Template Matcher (BFIR)
4. Multibit FIR Filter (MFIR)
We use areaand timing fromtheprototype for thepurposesof conservativeMATRIXcomparisons.
13.7.1 VSR
LSI’s variable shift register takes in byte wide data and delays it a speciﬁed number of clock
cycles. Itprovideseight,equidistantoutputs. ThemaximumdelaysupportedbytheLSIcomponent
is 8 512 4096 clock cycles. That is, given a sequence of inputs:
1 2
On the cycle when arrives, the VSR outputs eight values:
1 ( 1 (4 8))
2 ( 2 (4 8))
. . .
8 ( 8 (4 8))
Here is a valuebetween 0and 126. LSI implementstheir VSRin 64mm2 in a 1.5 CMOSprocess
(114M 2) using a semicustom standard cell methodology. The LSI VSR runs at a 26 MHz clock
rate (38.5ns clock).
A MATRIX implementation providing the full, worst-case functionality of the VSR requires
twoBFUstoimplementeach512 bytetapandtwo BFUsto implementa 9-bitmodulocounter, for a
total of 18BFUs (See Figure 13.14). The memory BFUs implementthe shift register byalternately
reading and writing from their main memory. The control contexts are programmed to support the
two instructions, read and write. The counter counts on every cycle from zero to 4 2 1.
The low bit of the counter is selected as the control bit on the memories while the high 8 bits serve
as the memory address. The match unit on the counter is set to look for 4 2 1. When
a match occurs, the counter executes a load zero control context instead of the normal increment
context. The 18 BFUs take 28.8M 2 18 518 4M 2. Operating on the two clock macrocycle,
the MATRIX VSR can run at 50MHz (20ns macroclock).
281Mem Mem Mem Mem
Mem Mem Mem Mem
Mem Mem Mem Mem
Mem Mem Mem Mem
Cnt Cnt
y1
x y2
y3
y4
y5
y6
y7
y8
Figure 13.14: MATRIX Implemenation of Full 8-TAP, 4096 shift, VSR
innerloop addi r1,#1,r1
ld BUFF[r1],r2
st OUTPUT,r2
ld INPUT,r2
st BUFF[r1],r2
blt r1,r3,innerloop
move r0,r1
bne r0,r0,innerloop
Figure 13.15: Processor Implementation of VSR
A typical processor implemenation of VSR (See Figure 13.15) takes 6 instructions per tap in a
tight loop. For the full 8 tap VSR, the processor implementation requires 48 instructions. MIPS-X
[HHC 87], one of the highest capacity processors we reviewed in Table 4.2, is 68M 2. With a
50ns clock cycle, the 48 instructions will dictate, at least, a 2400ns macroclock.
An FPGA implementation would be dominated by data memory. A pure 4-LUT design would
require up to 4096 8=32K cells. At 600K 2, a low-end estimate for 4-LUT size (See Table 7.1),
this is 19.7G 2. Exploiting the memory in an XC4000 part, we can pack 16 2 bits per CLB,
requiring 256 4=1K CLBs or 1.28G 2. The full shift register approach is trivial and should be
veryfast,sowewillassume100MHzoperation. ExploitingtheXC4000memorieswillrequireboth
a read and a write operation as with MATRIX so we will assume it can achieve 50MHz operation.
282Implementation LSI MATRIX MIPS-X 4-LUT XC4K
Area 114M 2 518M 2 68M 2 19.7G 2 1.28G 2
Cycle 38.5ns 20ns 2400ns 10ns 20ns
Capacity 4.4 2s 10.4 2s 163 2s 197 2s 25.6 2s
Ratio 1.0 2.4 37 45 5.9
Table 13.3: VSR Implementation Comparison
Table 13.3 compares the VSR implementations. The MATRIX implementation is 2.4 larger
than the semicustom LSI implementation, 2.5 smaller than the XC4000 implementation, and
16 smaller than the processor implementation. If the shift register requires less than 2048 delay
slots, MATRIX can implement each tap with a single BFU and use a single counter. This cuts
the implementation area and capacity in half, bringing it within 20% of the capacity of the LSI
implementation. Smaller shift registers with fewer taps will allow further reduction in BFUs for
the MATRIX implementation. Capacity requirements for the FPGA implementations similarly
reduce with total shift register length. The capacity required for the processor implementation will
decrease with the number of taps.
283Implementation LSI MATRIX MIPS-X
Area 235M 2 11117M 2 68M 2
Cycle 37ns 20ns 32450ns
Capacity 8.7 2s 222 2s 2221 2s
Ratio 1.0 26 260
Table 13.4: RVF Implementation Comparison
13.7.2 RVF
LSI’s rank value ﬁlter selects the th largest 12-bit value within a 64 sample window. That is,
on each cycle, the component takes in a new 12-bit sample, . It looks at the previous 64 values
( , 1, , 63), and selects the th largest, which it outputs as . If 1, it implements a
maximum ﬁlter; if 64, it implements a minimum ﬁlter, and if 32, it implements a median
ﬁlter. The LSI implementation occupies 132mm2 in a 1.5 CMOS process (235M 2) using an
array design methodology. The RVF runs at a 27 MHz clock rate (37ns clock).
The MATRIX implementation of RVF maintains a completely ordered list of the 64 window
values using a systolic priority queue scheme similar to [Lei79]. The systolicpriority queue allows
it to compute incremental updates to the list ordering rather than recalculating the entire ordering
on each cycle. To simulate the 64 tap window scheme, the systolic queue supports both an insert
and a delete operation. Each macrocycle requires two microcycles – one in which the old value is
deleted and one in which the new value is inserted. A ﬁxed delay register scheme like the VSR is
used to retime the old value for deletion 64 macrocycles later.
Using this style, an -tap, -bit wide MATRIX RVF implementation requires 3 8 2
BFUs, or 386 BFUs for the 64 tap, 12-bit case as implemented in the LSI ﬁlter. Each tap requires
two active data swap registers ( and ) and a comparator, each of which needs to be as wide as
the sample data. Figure 13.16 shows the basic array structure for the 12-bit sample case where
two BFUs are required for each register and comparator. The additional two BFUs are used for
the retiming memory and its associated counter. Figure 13.17 shows details of the datapath for a
tap slice and its adjacent elements. The registers are used to propagate insert and delete values
while the registers are used to hold sorted values. values propagate away from the th item
and values propagate toward it. By inserting data at the th value location, we obtain an update
latency of only one macrocycle or two primitive MATRIX cycles. The logic for a datapath slice
is described in Figure 13.17. Note that the logic and datapath shown are for a tap position below
the th position in the array. The logic and ﬂow are reversed for tap positions above the array.
Figure 13.18 shows the control setup used to implement the datapath logic providing single cycle
throughput for each comparison and swap operation.
We use a similar insert and delete structure for the processor RVF implementation which is
shown in Figure 13.19. For any width less than the processor word size, the processor imple-
mentation requires 10 9 instructions in a tight loop. For the full 64 tap VSR, the processor
implementationrequires 649 instructions. Again, using theMIPS-X processor this requires 68M 2
and 649 50ns 32450ns.
284Cmp Cmp Ahigh Alow Bhigh Blow
Cmp Cmp Ahigh Alow Bhigh Blow
Cmp Cmp Ahigh Alow Bhigh Blow
Cmp Cmp Ahigh Bhigh Blow rth
value
Input
New/Old
Cmp Cmp Ahigh Alow Bhigh Blow
Alow
Internal datapath connections omitted – See Figure 13.17.
Figure 13.16: MATRIX RVF Array
Table 13.4 compares the RVF implementations. The MATRIX implementation is 26 larger
than the custom implementation and 10 smaller than the processor implementation. If less taps
are required, both the matrix and the processor implementation decrease linearly in the number
of taps. For 8-bit or smaller sample values, the MATRIX implementation will halve its datapath
requirements. If one only wants to ﬁlter for the maximum or minimum value, a straightforward
shift and compare reduce scheme will only require 2 BFUs and operate at 100MHz throughput.
For a maximum or minimum ﬁlter, the MATRIX implementation requires less capacity than the
LSI RVF for 8-bit ﬁlters with less than 16 taps or 12-bit ﬁlters with less than 8 taps.
285A
A
A B
B
B
Cmp
if (old sample)
if ( )
MIN-VALUE
else
else if ( )
else
Figure 13.17: RVF Dataslice and Logic for Cells Below th Postion
Cmp Control Bit old bit
Control Bit Compare unit’s match output
B NOR computes old (Select Source) old (Select Source)
Control Context 0 Statically Route from
Control Context 1 NOR plane speciﬁcation selects input
Control Bit Compare unit’s match output
A NOR computes old (Select MIN-VALUE Source) old (Select Source)
Control Context 0 Statically Route from
Control Context 1 NOR plane speciﬁcation selects input
Figure 13.18: Control for MATRIX RVF for Cells Below th Postion
286//r3 – number of taps
//r5 – delay ring buffer head
//OLD – ring buffer head
//BUFF – sorted result
new ld new,r4
mov r0,r1
ld OLD[r5],r6
st OLD[r5],r4
beq r5,r3,resetoldp
addi,r5,#1,r5
findloop ld BUFF[r1],r2
ble r2,r4,insert
addi,r1,#1,r1
blt r1,r3,findloop
insert st BUFF[r2],r4
addi,r1,#1,r1
ld BUFF[r1],r4
blt r1,r3,insert
findold mov r3,r1
ld BUFF[r1],r4
addi,r1,#1,r1
removeloop ld BUFF[r1],r2
st BUFF[r1],r4
beq r2,r6,done
mov r4,r2
addi,r1,#1,r1
blt r1,r3,removeloop
b next
resetoldp mov,r0,r5
b findloop
Figure 13.19: Processor Implementation of RVF
287shift shift shift shift
match
  cnt
match
  cnt
match
  cnt
match
  cnt
+ +
+
Figure 13.20: MATRIX BFIR Datapath
13.7.3 BFIR
LSI’s binary ﬁlter and template matcher performs binary template matching across a 1024 bit
template. That is:
1023
0
AND XOR 13 2
Here is a vector of 1024 bit match values and is a mask indicating which positions are “don’t
care” values and should be ignored. LSI implementstheir VSR in 88mm2 in a 1.5 CMOSprocess
(156M 2) using a full custom design methodology. The LSI BFIR runs at a 27 MHz clock rate
(37ns clock).
The MATRIX implementation comes in three parts shown in Figure 13.20. A set of shift
registers provide the bit level samples. A set of BFUs use their memories to perform matching and
counting,starting with 8 bits of input and producing a 4-bit sum of the number of matches. Finally,
an adder tree reduces the partial sums to a single result. To handle the 1024 tap problem, MATRIX
requires 1024
8 128 BFUs for bitwise shifting and another 128 BFUs for matching. The sum tree
is 7 stages deep. Since the ﬁnal two stages add 9- and 10-bit sums, they each require 2 BFUs per
addition, while each of the others requires a single BFU per sum, making for a total of 130 BFUs
in the adder tree. Together, the MATRIX implementation requires 386 BFUs (11.1G 2) and can
operate at the full 100MHz basic cycle rate.
The processor implementation shown in Figure 13.21 stores and masks data in 32-bit units to
exploit its datapath. It also uses a programmed lookup table to count ones. The processor only
counts ones a byte at a time so that the count one’s lookup table can ﬁt in a reasonably sized
data cache. The main loop takes 25 instructions per word. For a 1024 tap problem, this makes
1024
32 25 32 25 800 total instructions. The MIPS-X processor implementation then is
68M 2 and 800 50ns 40000ns.
288//r14 – number of taps
//r15 – byte mask
//BUFF – stored input
//CARE – mask bits to check
//MASK – values to check
//CNTONES – lookup table to count ones in a byte
new mov r0,r1
ld BUFF[r1],r2
addi,r0,TAPS,r14
addi,r0,#0xff,r15
addi,r0,r0,r6
top ld BUFF[r1],r3
sh r2,r3,r2,#1
ld MASK[r1],r4
xor r4,r2,r5
ld CARE[r1],r4
and r4,r5,r5
and r5,r15,r4
ld CNTONES[r4],r4
add r4,r6,r6
asr r5,r5,#8
and r5,r15,r4
ld CNTONES[r4],r4
add r4,r6,r6
asr r5,r5,#8
and r5,r15,r4
ld CNTONES[r4],r4
add r4,r6,r6
asr r5,r5,#8
and r5,r15,r4
ld CNTONES[r4],r4
add r4,r6,r6
st BUFF[r1],r2
move r3,r2
addi r1,#1,r1
ble r1,r14,top
Figure 13.21: Processor Implementation of BFIR
289Implementation LSI MATRIX MIPS-X XC4K
Area 156M 2 11.1G 2 68M 2 2.32G 2
Cycle 37ns 10ns 40000ns 10ns
Capacity 5.8 2s 111 2s 2720 2s 23 2s
Ratio 1.0 19 470 4.0
Table 13.5: BFIR Implementation Comparison
An FPGA BFIR could take a similar form to the MATRIX implementation. 1024 LUTs would
compose the shift register. 1024
3 2 684 4-LUTs compose the match and initial reduce. The
sum tree requires slightly over 1000 full adder bits – 1000 XC4K CLBs or 2000 4-LUTs. In
total, an XC4K implementation would require 1850+ CLBs, or 2.3G 2. Using the fast carry on
the XC4K, and pipelining the adder stages, the basic cycle could be as low as 10ns assuming an
optimal physical layout.
Table 13.5 compares the BFIR implementations. The MATRIX implementation is 19 larger
than the custom implementation, 4.8 larger than the Xilinx implementation, and 24 smaller
than the MIPS-X implementation. If the “care” region is sparse, the FPGA implementation can
easily take advantage of it, using less match and sum reduce units (e.g. [VSCZ96]). If the sparsity
is in 8-bit chunks, MATRIX can similar exploit the sparseness. The processor implementation
can exploit sparseness, as well, but requires even larger chunks for it to be beneﬁcial. Resource
requirements for all the programmable implementations are proportional to the template size, so
their areas decrease with the number of binary taps.
290Architecture Reference area and time Filter TAPs
2s
16b DSP [WDW 85] 100ns/TAP 0.65
[vMWvW 86] 125ns/TAP 0.090
[KNK 87] 50ns/TAP 0.057
[CBBF87] 60ns/TAP 0.082
[PML 89] 100ns/TAP 0.051
[SKYH92] 50ns/TAP 0.072
[USO 93] 47ns/TAP 0.041
32b RISC MSTEP MIPS-X [HHC 87] 10+ cycles/TAP 0.029
32b RISC/DSP [NHK95] 40ns/TAP 0.022
64b RISC Alpha [GBB 96] 2.3ns/TAP 0.064
(systolic) 3 BFUs, 20ns/TAP 0.56
MATRIX Section 13.3 (VLIW) 12 BFUs, 20ns/TAP 0.14
(microcoded) 8 BFUs, 90ns/TAP 0.048
Full Custom LSI [Rue89] 45ns/64 TAPs 3.6
Table 13.6: FIR Survey – 8 8 multiply, 24-bit Accumulate
13.7.4 MFIR
The LSI multibit ﬁnite-impulse response ﬁlter is a 64-tap, 8-bit FIR ﬁlter:
63
0
TheMFIRis implementsin 225mm2 in a 1.5 CMOSprocess(400M 2)usinga full customdesign
methodology. The LSI MFIR runs at a 22 MHz clock rate (45ns clock).
In Section 13.3, we have already seen several MATRIX FIR implementations. To handle the
same generality as the LSI MFIR, we need to handle a 24-bit accumulate instead of the 16-bit
accumulate used in the examples shown in Section 13.3. This adds one cycle per tap to the
microcoded implementation, one BFU to the VLIW implementation, and one BFU per tap to the
systolic implementation. Table 13.6 compares the LSI and MATRIX implementations along with
processorandDSPimplementations. Forthetable, we useanapplication-speciﬁcmetricandreport
the area-time capacity required per TAP in each of the implementations.
The systolic MATRIX implementation is 6 larger than the full-custom LSI implementation,
20 smaller than the MIPS-X processor implementation, and 9 smaller than the Alpha imple-
mentation. Note also that the VLIW MATRIX implementation, which resembles modern DSP
architectures, is 2 smaller than modern DSPs. The systolic version is 8 smaller than the DSPs.
Thecapacityrequirementsfortheprocessors,DSPs,andMATRIXwilldecreasewiththenumberof
taps, whiletheLSI implementationis ﬁxed. At10 ﬁlter taps, thesystolic MATRIXimplementation
uses less capacity than the LSI MFIR.
291Table 13.7 provides an expanded table for FIRs with 16-bit accumulates. Here, we see more
clearly that the systolic MATRIX implementation is on par with reconﬁgurable implementations
such as PADDI and FPGAs. The MATRIX VLIW is comparable to DSPs. The MATRIX mi-
crocoded yields performance comparable to microprocessor implementations. It is this versatility
to efﬁciently span such a wide range of raw performance requirements which makes MATRIX an
interesting and powerful general-purpose architecture.
13.7.5 Image Processing Summary
Acrossthefourtasks,weseethattheMATRIXimplementationisroughlyanorderofmagnitude
largerthanthecustomimplementation(6 ,19 ,26 ,and2.4 ). Sinceitremainsgeneral-purpose,
MATRIX retains the ability to deploy resources to adapt to the problem size. For many instances
of problems the area-time penalty will be much less.
At the same time, we saw that MATRIX provided an order of magnitude smaller implementa-
tionsthan conventional processors(16 ,10 ,24 ,20 ). The variationin the beneﬁts is somewhat
telling. The one task where MATRIX only had a 10 advantage is the one task which required
a 16-bit datapath, while all the others essentially used 8-bit datapaths. Combining that observa-
tions with our earlier observation that MATRIX has 3 the raw computational density of modern
processors, we can decompose MATRIX’s capacity advantage over processors as: roughly as:
3 raw computational capacity
4 versus 8-bit, 2 versus 16-bit – granularity (datapath deployability)
1.5-2 elimination of overhead operations
For the highest throughput implementations of these tasks, aggressive FPGA or DPGA im-
plementations may approach the MATRIX implementation. We saw cases where MATRIX was
2-10 smallerthanoptimistic FPGAimplementations. We alsosaw naturallybit-level tasks where
MATRIX might be 4-5 worse than an FPGA implementation.
292Architecture Reference area and time Filter TAPs
2s
16b DSP [WDW 85] 100ns/TAP 0.65
[vMWvW 86] 125ns/TAP 0.090
[KNK 87] 50ns/TAP 0.057
[CBBF87] 60ns/TAP 0.082
[PML 89] 100ns/TAP 0.051
[SKYH92] 50ns/TAP 0.072
[USO 93] 47ns/TAP 0.041
32b RISC MSTEP MIPS-X [HHC 87] 10+ cycles/TAP 0.029
32b RISC/DSP [NHK95] 40ns/TAP 0.022
64b RISC Alpha [GBB 96] 2.3ns/TAP 0.064
XC4K [GN94] 67 CLBs, 184ns/16-TAPs 1.0
[CME93] 400 CLBs, 100ns/4-TAPs 0.080
PADDI2 [YR95] 5 EXUs, 20ns/TAP 0.93
(systolic) 2 BFUs, 20ns/TAP 0.87
MATRIX Section 13.3 (VLIW) 11 BFUs, 20ns/TAP 0.16
(microcoded) 8 BFUs, 80ns/TAP 0.054
Gate Array
ﬁxed coefﬁcient [YJY 90] 10ns/64 TAPs 21
Full Custom [Rue89] 45ns/64 TAPs 3.6
[CLRA90] 25ns/4 TAPs 0.68
[GNC 90] 33ns/16 TAPs 3.5
[RK92] 50ns/10 TAPs 2.4
ﬁxed coefﬁcient [LS92] 6.7ns/43 TAPs 57
– symmetric ﬁlter only
– 24-bit accumulate
– 16 16 architecture
Table 13.7: FIR Survey – 8 8 multiply, 16-bit Accumulate
29313.8 Summary
All conventional, general-purpose computing architectures set the resources for instruction
distribution and control and bind datapaths to instructions at fabrication time. This, in turn, deﬁnes
the efﬁciency of the architecture at handling tasks with a given wordsize, throughput, and control
structure. Large applications typically work with data items of multiple sizes and subtasks with
varying amounts of regularity. Application sets have an even wider range of computational task
characteristics. Consequently, no single, ﬁxed, general-purpose architectural point can provide
robust performance across the wide range of application requirements.
To efﬁciently handle the wide range of application characteristics seen in general-purpose
computing, we developed MATRIX, a novel general-purpose architecture which uses multilevel
conﬁguration and a single pool of network and datapath elements to defer until application run
time:
1. binding of primitive elements to roles such as data memories, instruction stores, datapath
elements, or control units
2. binding of datapaths to instructions
3. interconnection of primitive elements
Usingmetaconﬁguration,MATRIXcandeployprimitiveresourcesandinterconnecttovariousroles
as best suits the application. In this manner, MATRIX can provide as much dynamic instruction
control, instruction sharing, static dataﬂow, or independent control ﬂow as required by the task.
MATRIX’s post-fabrication conﬁgurability of instruction organization is unique, differentiating it
from all previous general-purpose computing architectures.
An ongoing prototyping effort shows promising results. While the VLSI implementation
has considerable room for improvement, the prototype has 3 the raw computational density of
conventional processorsand achieves 10 the yieldedcomputational density on regular, byte-level
computing tasks. At the same time, the prototype holds its own on less regular tasks, achieving
performance comparable to conventional processors.
29413.9 Area for Improvement
The concrete microarchitecture presented here has been our initial vehicle for studying the
basic concepts behind MATRIX and providing a concrete grounding for them. In these respects
the concrete microarchitecture has been very successful. However, this microarchitecture fails to
achieve the full breadth of performance robustness promised by the MATRIX architectural style.
Figure 13.22 shows the efﬁciency of the MATRIX microarchitecture at handling tasks with
various instruction depths and datapaths widths. Shown alongside MATRIX is the efﬁciency for
a conventional architecture with ﬁxed instruction distribution. These graphs are similar to the one
shown in Section 9.5. The efﬁciency is the ratio between the size of the implementation in the
target architecture versus the size of the conventional architecture with the instruction depth and
datapath width perfectly matched to the task. We assume here that MATRIX must deploy eight
BFU instructionstores per independentdatapath for control. That is, we assume all eight MATRIX
ports must be fed with dynamic instructions.
It is not surprising that MATRIX does not have the peak performance of the ﬁxed architecture
at its optimal design point. However, the poor efﬁciency across such a broad range of space is
disappointing. We can identify several effects from the graph:
The performancecliffbetween thepathlengthoftwo andfour arisessincewe canhandle two
contexts with the control contexts, but four or more require that we deploy BFU instruction
stores. For datapaths of eight bitsor less in width,we go fromone BFU per datapathelement
to nine as the task goes from two instructions to four.
At large path lengths ( 256), and small datapaths, we asymptotically approach 25% efﬁ-
ciency. We notedearlier(Section9.2)thattheinstructionmemorydominatestheinterconnect
and compute area in this region. We also noted that the MATRIX memory makes up 25% of
the BFU area (Section 13.13), so we are seeing the implementation efﬁciency being dictated
by the instruction memory efﬁciency.
UnliketraditionalarchitecturesMATRIXimplementationscontinuetobecomemoreefﬁcient
with larger task datapath width. As we saw in Section 13.4, MATRIX does not need to
deploy additional instruction memories to handle larger datapaths. As the datapath grows,
the instructionmemory overhead is amortized over a greater number of elements, improving
the overall implementation efﬁciency.
OnethingwhichmaybeunfairinthecomparisoninFigure13.22istheinterconnectRentparameter,
. The MATRIX microarchitecture under discussion has fully populated input switches. Also note
that this comparison is strictly based on varying instruction depth and datapath width and does
not account for variations in control requirements. The ﬁxed architecture will suffer more than
MATRIX as the number of natural task control streams varies.
AlsoshowninFigure13.22isaMATRIX architecturewhichlessenstheBFUoverheadpenalty
forcaseswithapathlengthbetween2and256. MATRIX assumesthatitcanuseeachBFUmemory
as two 128 8 instruction stores, bringing both memory read ports out to routed lines and allowing
path lengths less than 256 to use only four BFUs per datapath. MATRIX also assumes the
addition of two more control contexts.
These graphs suggest:
2951
4
16
64
256
1024
Path Length
1
4
16
64
128
Design w
0.2
0.4
0.6
0.8
1.0
Efficiency
1
4
16
64
256
1024
Path Length
1
4
16
64
128
Design w
1
4
16
64
256
1024
Path Length
1
4
16
64
128
Design w
0.2
0.4
0.6
0.8
1.0
Efficiency
1
4
16
64
256
1024
Path Length
1
4
16
64
128
Design w
1
4
16
64
256
1024
Path Length
1
4
16
64
128
Design w
0.2
0.4
0.6
0.8
1.0
Efficiency
1
4
16
64
256
1024
Path Length
1
4
16
64
128
Design w
Top Left – 2, 2, 0 70, 8, 64, 0, 2048
Top Right – MATRIX model
Bottom – MATRIX – 4 control contexts, use BFU I-store memory as 2 128 8
Figure 13.22: Efﬁciency for MATRIX and Fixed 8-bit Architecture ( 0 70)
296The microarchitecture may be too coarse-grained in its deployment of resources. Context
memory deployment suffers particularly from the large chunk size for memory deployment.
Metaconﬁgurationandinterconnectoverheadsareparticularlylargecomparedtothememory
size.
297Part V
Review and Extrapolation
29814. Reconﬁgurable Processing Architecture Review
Special-PurposeComputing Webuildcomputingdevicestoalgorithmicallytransformrawinput
data into results. Special-purpose computing devices are designed with one particular transforma-
tion embedded into their architecture and implementation. Each such device can solve only the
particulartransformationproblem, andthat problemis setpriorto device fabrication. Conventional
fabrication techniques require long turn-around (weeks to months) to produce devices, high up
front costs for setup, and large volume sales to amortize out ﬁxed costs for design, tooling, and
equipment.
Manyofthecharacteristicswhichcomewithspecial-purposecomputingdevicesareundesirable
or untenable in numerous situations.
Device dedicated to a single function
– Device can be quickly oboslescedas functionalrequirementsoftenchange, transforma-
tions are tuned, algorithms advance, and missions and tasks evolve.
– When the function needed by a task is time or data dependent the special-purpose
devices for functions which are not needed at some point in time sit idle and cannot be
used for any other function which may be required by the task.
– When lowerthroughputis required fromthedevice thanitsnativecapability, thedevice
has spare capacity which cannot be put to productive use.
High up front cost
Long delay from concept to delivery
Economical only in volume
General-PurposeComputing General-purposedevicesareouralternativetotheseﬁxedfunction
devices. Here,webuildcomputingdeviceswhichcanbeconﬁguredtosolveavarietyofcomputing
problems. Insteadofbuildingadevicewithexactlythecomputationalunitsandhardwireddataﬂow
necessarytosolveasingleproblem,webuildadevicewithasetofprimitivecomputationalelements
interconnected via a ﬂexible interconnect. Post-fabrication, we control the behavior of the device
with instructions,extra inputswhich tell thedevice whatcomputationsto performand how to route
data during the computation. As a result, we:
Make a single device appealing for a wide-range of tasks. While each, individual task may
lack the volume required for a dedicated device to be economical, the general applicabil-
ity across many tasks provides the volume necessary to make the general-purpose device
economical.
Eliminate the fabrication delay necessary to put a new computational task into use.
299Eliminate the up front cost associated with producing custom hardware for a task.
Make it possible to customize a single device to perform any of a large number of different
computingtask, allowingthedevice to adaptto changes inrequirements, or shareits capacity
among a variety of computing tasks.
The RP-space deﬁned here models a large domain of reconﬁgurable architectures within the
general-purpose architecture space.
Reconﬁgurable Computing Costs Reconﬁgurable devices gain their breadth of use at a cost in
computational density. Reconﬁgurable devices must add:
1. Flexible interconnect or data ﬂow
2. Instructions to control compute units and data ﬂow
Additionally, the computational units in these devices must be more general than in the special-
purpose devices where each compute unit may perform a single, focussed computation.
Replacing ﬁxed interconnect with ﬂexible interconnect is the most costly single addition for
reconﬁgurable architectures. A decent amount of programmable interconnect may add two orders
of magnitude in size to the reconﬁgurable implementation compared to the fully special-purpose
implementation of the same task.
Instructions In contrast, the area required to hold a single, device-wide conﬁguration is, itself,
an order of magnitude smaller than the interconnect. That is, the area taken by a single instruction
is generally an order of magnitude smallerthan the active interconnectwhich it controls. However,
if we allocate space to hold tens of instructions per active compute element, the total instruction
memory area can easily equal the active compute and interconnect area. By the time we add
hundreds of instructions, the instruction memory area can dominate even the ﬂexible interconnect.
With this additional order of magnitude in overhead, such a reconﬁgurable device can easily be
three orders of magnitude larger per computational element than its special-purpose counterpart.
Since instruction area can quickly come to dominate even the ﬂexible interconnect, when
buildingreconﬁgurablecomputingarchitecturesweoftenlookforstructureintypicalcomputational
problems which will allow us to reduce the instruction size. One common technique is to control
several pieces of interconnect and computational elements with a single instruction. That is, we
assemble wide datapaths which are controlled together. This reduces the size of the conﬁguration
by reducing the number of instructions required to specify device behavior at any point in time.
Consequently, when we build a reconﬁgurable computing device, we must make decisions
about:
How many primitive computational elements are directed by each instructions?
How many instructions are controlled by each controller?
How many instructions are stored on chip?
How rapidly can the instructions change, chip-wide?
300The answers to these questions place a particular reconﬁgurable device in the RP-space. The
answers to each of these questions also determines the size of the reconﬁgurable device and its
efﬁciency on various tasks.
If the task has data elements of width of , the architecture provides ﬁner
instruction control than necessary and pays an overhead for redundant instruction memory.
Ifthetaskhasdataelementsofwidthof ,thearchitecturedoesnotallowcontrol
over the compute element at the ﬁne granularity of the task, and computational capacity in
the architecture goes to waste.
If the task needs to cycle through only a few different instructions, but the architecture
provides large instruction memories, the reconﬁgurable device is unnecessarily large for the
task, wasting area in unused memories.
If the task needs to cycle through a large number of different instructions at different times
but the architecture provides small instruction memories, the reconﬁgurable device will not
be able to store all the instructions logically associated with each computational element.
Extra computational elements will be required simply to hold all of the task’s instructions,
but these extra computational elements will effectively sit idle during computation.
If the task requires more independent control of computing resources than provided by the
architecture, either resources will go unused since they cannot be controlled or memory
requirements will increase greatly to compensate for the lack of control independence.
If the task requires less independent control than the architecture supplies, the additional
controllers and resources are redundant and add to device overhead.
If the task requires rapidly changing instructions, but the architecture does not meet the
required bandwidth, computational resources sit idle, paced by task description bandwidth
not the availability of computing resources.
If the task can handle slowly changing instructions, but the architecture dedicates signiﬁcant
areato providinghighinstructiondeliverybandwidth,muchofthededicatedareais overhead
making the device larger than necessary for the task.
Interconnect In devices where the ratio between instructions and compute elements is low,
ﬂexible interconnect will remain the dominant area feature in reconﬁgurable devices. Here, a
device must decide how richly to interconnect the compute elements. Rich interconnect makes
the routing area even greater, while inadequate interconnect can make it impossible to make use
of the available computing elements. The choice in interconnect richness determines where the
architecture will be most efﬁcient.
If the interconnect is richer than needed by the task, the device will be larger than necessary.
If the interconnect is not as rich as required by the task, the task must be laid out sparsely
on the architecture. Portions of the interconnect and compute resources are wasted as they
cannot be used.
301In all computing devices there are two components associated with routing data between
producers and consumers:
1. Spatially routing intermediates from the compute element which produced them to those
which consume them
2. Retiming the intermediates for the time when the consumer is ready to use them
Particularly, in reconﬁgurable devices with expensive, ﬂexible interconnect, memories can hold
values for retiming more cheaply than active interconnect.
Degrees of Generality and Reconﬁgurability There are, of course, degrees of “generality”
between fully special-purposedevices and general-purposedevices. Some special-purposedevices
are given limited conﬁgurability to broaden there use – e.g. a typical UART can be conﬁgured to
handle different data sizes, data rates, and parities. Some devices are targeted at being “general”
within very speciﬁc domains. Digital signal processors are one of our most familiar examples of a
general-purpose, domain-optimized device. The domain may dictate the typical data element size
or desirable instruction depth. Further, the domain may allow a more structured programmable
interconnect to sufﬁce. Nonetheless, to the extent that we have post-fabrication control over the
computations which a device performs, the device will have some form of instructions and will
generally have some level of ﬂexible interconnect. With these features it exhibits reconﬁgurable
characteristics, and many of the the architectural characteristics, relations, and issues we have
identiﬁed in our, more ideal, RP-space.
FPGAs Conventional FPGAs fall at a moderately extreme point in our RP-space with single bit
wide datapaths and single instruction deep instruction memories. At this point, they are efﬁcient
on the highest throughput, ﬁne-grained computing tasks and their efﬁciency drops rapidly as the
task throughput requirements diminishes and the word size increases.
Beyond FPGAs in the ReconﬁgurableComputing Space Beyond FPGAs there is a rich recon-
ﬁgurable architecture space. Our DPGA represents one different point in this architectural space
(See Figure 14.1). The DPGA retains the bit-level granularity of FPGAs, but instead of holding a
single instruction per active array element, the DPGA stores several instructionsper array element.
The memory necessary to hold each instruction, is small compared to the area comprising the array
element and interconnect which the instruction controls. Consequently, adding a small number
of on-chip instructions does not substantially increase die size or decrease computational density.
The addition does, however, substantially increase the device’s ability to efﬁciently handle lower
throughput, more irregular computational tasks. At the same time, a large number of on-chip
instructions is not as clearly beneﬁcial. While the instructions are small, their size is not trivial –
supporting a large number of instructions per array element (e.g. tens to hundreds) would cause
a substantial increase in die area decreasing the device efﬁciency on regular tasks. Consequently,
we see that we can achieve a design point which is moderately robust across a wide range of
throughputvariations bybalancingthe instructionmemory area with theﬁxed area for interconnect
and computational units.
3021
4
16
64
256
1024
Path Length
1
4
16
64
128
Design w
0.2
0.4
0.6
0.8
1.0
Efficiency
1
4
16
64
256
1024
Path Length
1
4
16
64
128
Design w
1
4
16
64
256
1024
Path Length
1
4
16
64
128
Design w
0.2
0.4
0.6
0.8
1.0
Efficiency
1
4
16
64
256
1024
Path Length
1
4
16
64
128
Design w
FPGA DPGA
1 16
1, 4, 2, 0 5, 0, 16384
Figure 14.1: FPGA and DPGA efﬁciency in RP-space
The importance of efﬁciently supporting retiming of intermediates was most clearly demon-
strated in the context of the DPGA design. Here, we saw that the beneﬁts of deeper instruction
memories were substantially reduced if we forced retiming to occur on active interconnect. How-
ever, whenweprovidedarchitecturalregistersso thatretimingcouldtakeplaceinregisters, DPGAs
were able to realize typical computing tasks in one-thirdthe area required by conventional FPGAs.
Whilewedidnotdetailtheminthisthesis,multiplecontextcomponentswithmoderatedatapaths
also come down essentially in this reconﬁgurable architectural space. Pilkington’s VDSP [Cla95]
has an 8-bit datapath and space for four instruction per datapath element. UC Berkeley’s PADDI
[CR92] and PADDI-II [YR95] have a 16-bit datapath and eight instruction per datapath element.
All of these architectures were originally developed for signal processing applications and can
handle semi-regular tasks on small datapaths very efﬁciently. Here, too, the instructions are small
compared to the active datapath computing elements so including 4-8 instructions per datapath
substantially increases device efﬁciency on irregular applications and robustness to throughput
variations with minimal impact on die area.
Flexible Deployment of Instruction Resources While architectures such as these are often
superior to the conventional extremes of FPGAs, any architecture with a ﬁxed datapath width,
on-chip instruction depth, and instruction distribution area will always be less efﬁcient than the
architecture whose datapath width, local instruction depth, and instruction distribution bandwidth
exactly matches the needs of a particular application. Unfortunately, since the space of allocations
303is large and the requirements change from application to application, it will never make sense to
produce every such architecture and, even if we did, a single system would have to choose one of
them. Flexible, post fabrication, assembly of datapaths and assignment of routing channels and
memories to instruction distribution enables a single component to deploy its resources efﬁciently,
allowing the device to realize the architecture best suited for each application. Our MATRIX
design represents the ﬁrst architecture to provide this kind of ﬂexible instruction distribution and
deployableresources. Using an array of 8-bit ALU and register-ﬁle building blocks interconnected
via a byte-wide network, our focus MATRIX design point has 3 the raw computational density
of processors and can yield 10 the computational density of conventional processors on high
throughput tasks.
30415. Projections
In Parts III and IV, and Chapter 14, we focussed on reconﬁgurable, general-purpose computing
devices roughly characterized by RP-space. In that focussed domain, we were able to look closely
at area costs, computational density, and efﬁciencies. General-purpose devices, more broadly,
also share many of the characteristics (e.g. instruction depth and width, interconnect richness,
data retiming) which we identiﬁed as key architectural parameters in RP-space and in the more
detailed architectural studies. In this chapter, we speculate more broadly on what the relationships
developedwhilefocusingonreconﬁgurabledevicesinRP-spacemighttellusaboutgeneral-purpose
architectures, in general. We emphasize that these extrapolations may overly trivialize important
architectural aspects which did not arise in RP-space, and we attempt to identify those aspects
during the discussion.
15.1 Role of Memory in Computational Devices
In our computing architectures, we have seen memory show up in two roles:
1. instruction storage
2. data retiming
Neither appears to be really fundamental for computing, but both are of pragmatic value as they
facilitateresourcesharingand reuse whichallows us to implementcomputingfunctions in less area
when throughput requirements are limited. In special-purpose computing architectures we did not
needinstructions. Foreaseofconstruction,weoftenuseclockedregisterstotoleratevariabledelays
through primitive blocks, but otherwise memory for retiming arises primarily from serialization
and reuse of common resources.
15.1.1 Memory for Instructions
Instruction memories reduce hardware requirements in two ways by allowing:
1. a fabricated resources to perform any of several functions
2. a resource to be shared among several different functions during a single computation
Select Function In our general-purpose devices, a single resource can perform any of a number
of different functions. This allows us not to have a single, dedicated piece of hardware for every
possible function ever desired. For an application or device requiring primitive component
computations,this realizes an important compressionfrom “all possible computingfunctions made
of primitives” to “all primitive computing functions required by this application.” Here, each
primitive computing element needs a conﬁguration memory to tell it what computation to perform
305and where its inputs are produced among the computing elements. The per computing element
overheadwe pay for thisreduction is high, mostly in termsof ﬂexible interconnect,but this quickly
balances the exponential reduction realized by only having to implement the functions required
by this task.
We can return to our pedagogical 4-LUTs to see this reduction more concretely. There are
2(24) different functions which can be implemented with , four-input gates. So, even
withthe100 areaoverheadpergaterequiredtosupportﬂexibleinterconnect,ourprogrammable,
4-LUTdevice is signiﬁcantlysmallerthan implementingallpossible input functionsfor anything
other than trivial values of .
Shared Function Our general-purpose devices also allow us to share each piece of hardware
among multiple functions within a single computing task. This aspect allows us to compress area
requirementsfurther from “all primitive computingfunctions requiredby this application”to “all
computing functions required at one point in time in order for this application to achieve the
requisite computational throughput.” Here, we take advantage of the fact that the conﬁguration
memory to describe a computing function is smaller than the active area required to route its inputs
and computethe result. In the extreme, this allows us to reduce the area required for a computation
from the area required for programmable compute primitives and their associated interconnect
to the area required to store the description of the computation and interconnect performed by
programmable compute elements.
What we trade for this reduction is computational throughput. With only active computing
functions,it requiresus, at least, cyclesto perform thecomputationof the primitivecomputing
functions in the original task. Sometimes, the original task already had a dependency structure
such that this reduction comes for free or at minor costs. Other times, we are trading increased
evaluation time for reduced implementation area. In the limit, where we have a single computing
element with instruction memory to hold instructions, the task can take cycles to evaluate.
We often talk about virtualizing hardware resources. The virtualization really substitutes a less
expensiveresources(e.g. aninstructioninmemory, statein memory,cheaperformsofmemory)for
amoreexpensiveone(e.g. apieceof hardwareto actuallyperformafunction,fastaccessmemory).
Behind all of these virtualizations, we must ultimatelyhave some form of physical memory to hold
the description of the virtualized resources and their state.
Notice that we can continue to push technology and structure in order to reduce this last limit,
but it cannot be avoided. We can apply aggressive memory technology, such as DRAM or ﬂash
memory,to reducingstoragecell size. We can storedata ondifferentmedia, suchas magneticdisks
or tape. We can exploit structure in the task description to compress the number of bits required
down to the Kolmogorov complexity limit. In the limit, however, we ultimately require sufﬁcient
area to store the description of the computing task and no further reduction is possible.
WenotedinSection4.4thatmemorycanbeusedasageneral-purposecomputingelement. That
role of memory is a special case of role of memory as instructions. The memory contents act as an
instructionwhichconﬁguresthememoryarrayto providethedesiredcomputationaltransformation
between the address inputs and the data outputs. In Section 4.5, we saw that computational portion
of conventional FPGAs, the LUTs, were programmed in exactly this way.
30615.1.2 Memory for Retiming of Intermediate Data
Once we begin to reuse primitive compute functions for different roles at different times, we
introduce the need to assure that the right data arrives at the inputs of the function at the right time.
This need is particularly acute when we serialize execution and use a single primitive to perform
multiple different functions, but it also appears when we reuse a primitive to perform exactly the
same function on logically different data. Since programmable interconnect is expensive, we use
memories as an inexpensive way to provide the temporal retiming necessary for correct execution.
The use of memory for retiming is pragmatic. We could get away with little more than pipeline
registers on interconnect. However, it is cheaper to transportdata forward in time through memory
than over interconnect. If we do not take advantage of this, much of the area savings potentially
associated with serializing execution and sharing primitive compute elements cannot be realized.
The requirements for data retiming depend on the interconnect structure of the problem, not
the number of compute elements in the task. The amount of retiming does depend on the amount
of serialization. With more parallelism, more data can be consumed as soon as it has been
spatially routed avoiding the need for retiming. As we compress size requirements by converting
task compute primitives into instructions sharing a small number of physical compute elements,
we must ultimately have space to store all computation intermediates at the widest point in the
computationﬂow. That is, we ultimatelyneed space for all the live intermediatesin a computation.
The number of such intermediates depends on the task and its mapping. The mapping should try to
minimize the number of such intermediates.
Note that all non-instruction uses of memory fall into this category.
Register File – We have already seen that register ﬁles perform the same functions as our
input retiming registers, transporting results in time from the point of production to the point
of consumption.
Main memory, including data caches – All the data results stored in memories are being
transported between the point of production and consumption.
Buffers an FIFOs – These are explicitly retiming the arrival of data to a time when a portion
of the system is ready to consume the data.
If we had not sequentialized execution and shared computational resources among multiple tasks,
we would not need these memories.
Even special-purpose devices often sequentialize their processing of data so that a few, ﬁxed
compute elements can serve to process data with nominally different roles. The most common
example of this is in audio, video, or image processing. Rather than dedicating a separate compu-
tational unit to each pixel in a frame, many pixels are processed on the same computational unit.
The pixel data stream is serialized into and out of the special-purpose device. The pixels within the
frame often need to be retimed so that the right pixel values are presented to the compute elements
at the right time. For example, when pixels are fed in by rows, it is often necessaryto perform row-
wise retiming on data so that the compute element can calculate column-wise interactions between
pixel elements. If all the data necessary for the computation were presented simultaneously and all
of the output was produced at once, this retiming would not be necessary. However, serialization
and reuse is oftennecessary to make theamount of hardwareresources, includingcomponent input
307and output bandwidth, tractable. The serialization allows us to share all of the hardware resources,
but requires that we provide unique storage space for intermediate data so that we perform the
correct computation on the shared resources.
15.1.3 Implications
There are two important ideas to take away from these observations on the role of memory:
1. Memories in computer architectures facilitate the sharing and reuse of expensive resources.
It is the pragmatic fact that the memory necessary to hold an intermediate or an instruction
is smaller in conventional technologies than the active computing and interconnect elements
whichprocessthedataaccordingto theinstructionwhichmakesitworthwhiletousememory
to reduce implementation area requirements.
2. As we go to heavier sharing, each doubling of our sharing factor does not result in a halving
of implementation area because we always leave behind a memory residue composed of (1)
instructions and (2) intermediate data. In the limit, the size of our computing element for a
task is dictated by the area to hold the instructions to describe the task and the intermediate,
live data which must be stored as the task computes.
30815.2 Reconﬁguration: A Technique for the Computer Architect
Device architects are often faced with the dilemma of balancing semantic expressiveness with
instruction distribution bandwidth. In processors, only a few bits are allocated to instruction
speciﬁcation limiting (1) the number of different computations which can be selected and (2) the
number of different sources which can be expressed. The latter manifests itself as limited address
space and limited size register ﬁles, while the earlier is often taken for granted. Architects are
reluctant to increase instruction width because it entails added costs in (1) on- and off-chip storage
space for all instructions, (2) distribution bandwidth, and (3) power for instruction distribution.
However, limited semantic expressiveness can force the processor to issue a large number of
instructions to perform the desired computation, resulting in even great losses in time and power
efﬁciency.
Conventional processors generally support an ALU which performs basic operations on 2 or 3,
word-widedatainputs. Todaywesee typical wordsizesof32 and 64 bits. Conventionalprocessors
further limit their instruction size to the word size to limit instruction bandwidth requirements.
As a consequence of this limitation, it can often take a large number of instructions to specify an
operation which is not inherently difﬁcult for the active silicon to perform.
To appreciate the magnitude of the semantic disparity here, we notice that there are:
2 2
2 2
functions from two -bit wide inputs to one -bit wide output. If we limit the speciﬁcation of
our function to bits, we can only address 2 functions with this instruction. Thus, if all of the
2 were equally likely, on average, it would take at least 2 2 cycles to compute a function.
In practice, a good fraction of the bits are dedicated to operand selection, increasing the
severity of the instruction limitation. While all operations are not equally likely, in practice, this
disparity demonstrates that conventional processor design makes an early binding, pre-fabrication
time, decision on the effective cost of basic operations. Many applications cannot use the active
silicon area on conventional processors efﬁciently since they cannot directly issue the instructions
native to the task.
Reconﬁgurationisatechniquewhichallowsustoﬁndaresolutiontothisdilemma. Reconﬁgura-
tion allows us thesemantic expressivenessof very large instructionswithoutpaying commensurate
bandwidth and deep storage costs for these powerful instructions. What we give up in this solution
is the ability to change the entire instruction on every cycle. Rather, the rate of change of the full
instruction is determined by the instruction bandwidth we are willing to expend. The distinction
between instruction bandwidth which delivers all the semantic content on every cycle and instruc-
tion bandwidth that can be used to load a larger semantic instruction is an important one because
conﬁguredinstruction bits which can be used for many operational cycles do not require additional
instructionbandwidth once loaded. Returning to our simple calculation above, it may take us 2 2
cyclesto loadaspeciﬁcationforaninstructiontheﬁrsttimeitisencountered. However, ifthisvalue
is loaded into conﬁguration memory, subsequent uses can operate using the loaded data, avoiding
thetimerequiredtoredundantlyspecifytheoperation. Anarchitecturewithoutconﬁgurationwould
require the 2 2 cycles each time the computation is required. Reconﬁguration thus allows us to
compress instruction distribution requirements in cases where the instruction changes slowly or
infrequently.
309Reconﬁguration opens a middle ground, or an intermediate binding time, between ‘behavior
which is hardwired at fabrication time’ and ‘behavior which is speciﬁed on a cycle by cycle
basis.’ This middle ground is useful to consider in the design of any kind of of computing device
not just conventional FPGAs. When designing a device with any general-purpose capabilities,
the architect’s decision can extend beyond what expressiveness to include or omit based solely
instruction size and bandwidth. Rather, the architect should consider the expressiveness which
may be required for efﬁcient task implementations and the rates at which various parts of the task
description change. Characteristics of the task which change infrequently can be conﬁgured rather
than broadcast.
31015.3 Projecting General-Purpose Computing onto RP-space
Our RP-space model articulated in Chapter 9 provided architecture implementation area esti-
matesbasedona fewmajorparameters. Instructiondepth( ), data width( ), interconnectrichness
( ), and intermediate data retiming support ( ) have been the focus of our discussion in Parts III
and IV. More broadly, these parameters have rough analogs in all general-purpose architectures.
Onecan, thus,generallyproject ageneral-purposearchitectureintoa pointin RP-spaceby identify-
ing these parameters and abstracting away architecture characteristics not covered in the RP-space
model.
15.3.1 General Hazards
The more general projectionto RP-space may be hazardous as it ignores many detailed charac-
teristics of real architectures in the broader general-purpose architecture space, such as:
No special-purpose capacity – we explicitly assumed only general-purpose building blocks
for RP-space. Most nominally “general-purpose” architectures include blocks of special-
purpose logic. The special-purpose blocks do not provide general-purpose capacity, but
can provide high density to applications when the specialized structures match the task
requirements of the application. The multiply example reviewed in Chapter 5 is the most
common instance of a special-purpose block added to general-purpose architectures.
Homogeneous processing arrays – we explicitly assumed homogeneous arrays. Because of
the mixed processing requirements in most computing tasks, a hybrid array which mixes
processing blocks with different parameters may be quite interesting. Extending the model
to reasonably encompass mixed architectures is an interesting direction for future work.
No boundary effects – we assumed single chip implementations for all of our comparisons,
in effect, assuming that full task implementations for all alternatives ﬁt onto a single die.
Since we are looking at 10’s to 100’s of G 2 of silicon area in the near future, a large class of
computing tasks or primary subtasks can be reasonably placed on a single die such that the
assumption seems reasonable. However, we alsosaw that inefﬁcient design pointscan easily
be two orders of magnitude larger than efﬁcient points. With this much area variation, it is
not really reasonable to assume that both implementations are single die implementations.
The larger implementation is likely to require a multiple-chip solution and will suffer further
degradation in latency and bandwidth due to chip crossings.
Anotherconsequenceof ignoringboundaryeffects is that themodeltrivializes limiteddevice
i/o effects between different components that might make up the core of a general-purpose
processing system. Notably, systems have traditionally placed bulk memory on different ICs
from the processing. As a result, care must be taken to prevent the limited boundary i/o
between computeandmemory devicesfrom beingtheperformancelimitingbottleneck. This
care often shows up as additional mechanism and memories on the processing chips to make
most effective use of the limited interchip i/o bandwidth and high interchip i/o latency.
31115.3.2 Processors, FPGAs, and RP-space
For years, microprocessors have been our canonical example of single-chip, general-purpose
computingdevices. It is temptingto try to understandthe relationbetween processors,FPGAs, and
RP-space. In Part II, we took a broad, empirical look at these devices and made a few, high-level
observations on their relative efﬁciencies. In this section, we revisit this comparison projecting
both architectures into RP-space.
Conventional processors have:
1. moderately wide datapath which have been growing larger over time (e.g. 16, 32, 64 bits)
2. support for large on-chip instruction caches which have also been growing larger over time
and can now hold hundreds to thousands of instructions (contexts)
3. high bandwidth instruction distribution so that one or several instructions may be issued per
cycle at the cost of dedicating considerable die area for instruction distribution
4. a single thread of computation control
As a consequence these devices are efﬁcient on wide word data and irregular tasks – i.e., those
tasks which need to perform a large number of distinct operations on each datapath processing
element. On tasks with narrow data items, the active computing resources are underutilized,
wasting computing potential. Processors pay overhead for their deep instruction memories. On
very regular computational tasks, the on-chip space to hold a large sequence of instructions goes
largely unused. Processors exploit wide datapaths to reduce the cost per instruction, but even
so, with instruction stores typical supporting thousands of instructions, instruction and retiming
memories dominate, leaving their peak general-purpose computational density three orders of
magnitude lower than special-purpose devices and one order of magnitude below FPGAs.
Lookingatmodernscalar, superscalar, andVLIW,processors,then,wemightabstractamodern
processor as: 2, 64, 1024. Processors use ALU bit-slices in lieu of lookup tables.
Each ALU bit-slice takes in two data inputs and a carry bit. As such, they provide less than a
full 2- or 3-LUT’s capacity per ALU bit, in general, but can provide an add, subtract, or compare
operation per bit which would require a pair of 3-LUTs. Processors also include:
special-purpose capacity (e.g. multipliers, ﬂoating-point units)
complicated ﬂow control (e.g. branch prediction, bypassing)
memory controllers to deal with boundary bottlenecks between compute and bulk memory
components (e.g. cache-controllers, TLBs)
TheseitemstendtomaketheareaofaprocessorlargerthanthatpredictedbythemodelinChapter9.
As we have seen in Table 4.1 and Section 4.1, when performing traditional ALU ops, processors
generally provideless bit operationsper ALUbit than a small LUT. These effects will tend to make
the RP-space projection of the processor optimistic in terms of area; that is, the real processor will
be larger and provide less computational capacity per unit area. On the other hand, the specialized
capacity in processors allow them to handle ﬁxed and ﬂoating point arithmetic operations more
efﬁciently than would be predicted by the RP-space projection.
We have already seen that conventional FPGAs have:
3121
4
16
64
256
1024
Path Length
1
4
16
64
128
Design w
0.2
0.4
0.6
0.8
1.0
Efficiency
1
4
16
64
256
1024
Path Length
1
4
16
64
128
Design w
1
4
16
64
256
1024
Path Length
1
4
16
64
128
Design w
0.2
0.4
0.6
0.8
1.0
Efficiency
1
4
16
64
256
1024
Path Length
1
4
16
64
128
Design w
FPGA RP-space mapped Processor
4, 1, 1 2, 1024, 64
2, 0 5, 0, 16384
Figure 15.1: Comparing efﬁciency of FPGA and Processor idealizations in RP-space
narrow datapath (e.g. almost always one bit)
on-chipspaceforonly oneinstructionpercomputeelement–i.e. thesingleinstructionwhich
tells the FPGA array cell what function to perform and how to route its inputs and outputs
minimal dieareadedicatedto instructiondistributionsuchthat ittakes hundredsofthousands
of compute cycles to change the active set of array instructions
As a consequence these devices are efﬁcient on bit-level data and regular tasks – i.e., those tasks
which need to repeatedly perform the same collection of operations on data from cycle to cycle.
On tasks with large data elements, these ﬁne-grain devices pay excessive area for interconnect and
instruction storage versus a coarser-grain device. On very irregular computational tasks, active
computing elements are underutilized – either the array holds all subcomputations required by a
task, but only a small subset of the array elements are used at any point in time, or the array holds
only the subcomputation needed at each point in time, but must sit idle for long periods of time
between computational subtasks while the next subtask’s array instructions are being reloaded.
The peak computational density for FPGAs is two orders of magnitude lower than special-purpose
devices because they pay overhead primarily for the ﬂexible interconnect.
Figure 15.1 shows graphically this idealized comparison projected into RP-space in the style
used in Section 9.5. As noted before, the FPGA is less than 1% efﬁcient at the cross point of wide
task data words and long path lengths. Similarly, the modeled processor is less than 1% efﬁcient
313processingsingle bit data items at a path length of one. Certainly, if the processor needs to perform
bit operations that do not match its special-purpose support, the inefﬁciency will be at least this
large – and may be greater due to the effects noted above which make the real processor larger than
the model.
15.3.3 General-Purpose Computing Space
We have already noted that RP-space is large such that we can see two or more orders of
magnitudeinefﬁciencylosswhentheapplicationrequirementsaremismatchedwiththearchitecture
structure for ﬁxed instruction architectures. Our comparison in the previous sections underscores
that the general-purposearchitectural space is even larger making it paramountthat one understand
therealm ofefﬁciency for each“general-purpose”computingdevicewhen selectingadevice for an
application. They underscore the room for intermediate architectures such as the DPGA, PADDI,
or VDSP to cover parts of the space which are not covered well by either conventional extremes
of processor or FPGAs. They also underscore the desirability of architectures like MATRIXwhich
allow some run-time reallocation of resources to provide more robust yielded performance across
the computational space.
Hybrid Since many tasks have a mix of irregular and regular computing components and a mix
of native data sizes, a hybrid architecture which tightly couples arrays of mixed datapath sizes and
instructiondepths along with ﬂexible controlmay be able to provided the most robust performance
across the entire application. While this thesis focussed on characterizing the implications of each
pure architectural point, it should be clear from the development here how a hybrid architecture
might be better suited to the mix of datasizes and regularities seen in real applications. In the
simplest case, such an architecture might couple an FPGA or DPGA array into a conventional
processor, allocating the regular, ﬁne-grained tasks to the array, and the irregular, coarse-grained
tasks to the conventional processor. Such coupled architectures are now being studied by several
groups (e.g. [DeH94] [Raz94] [WC96]).
31415.4 Trends and Implications for Conventional Architectures
In summary, we see that conventional,general-purposedevice architectures, both microproces-
sorsand FPGAs,live farapart in aricharchitecturalspace. Asfeaturesizesshrinkandtheavailable
computing die real-estate grows, microprocessors have traditionally gone to wider datapaths and
deeper instruction and data caches, while FPGAs have maintained single-bit granularity and a sin-
gle instruction per array element. This trend has widened the space between the two architectural
extremes,andaccentuatedtherealmwhereeachis efﬁcient. Amoreeffective useofthesiliconarea
now becoming available for the construction of general-purpose computing components may lie in
the space between these extremes. In this space, we see the emergence of intermediate architec-
tures, architectureswith ﬂexibleresourceallocation, andarchitectureswhichmixcomponentsfrom
multiplepointsin thespace. BothprocessorsandFPGAsstandto learn fromeachother’s strengths.
In processor design, we will learn that not all instructions need to change on every cycle, allowing
us to increase the computational work done per cycle without correspondingly increasing on-chip
instructionmemory area or instruction distributionbandwidth. In reconﬁgurable device design, we
will learn that a single instruction per datapath is limiting and that a few additional instructions are
inexpensive, allowing the devices to cope with a wider range of computational tasks efﬁciently.
15.4.1 Microprocessors
Over the past two decade, microprocessors have steadily increased their word size and their
cache size. While these trends allow larger tasks to ﬁt in on-chip caches and allow processors
to handle larger word operations in a single cycle, the trends also make processors less and less
efﬁcient in their use of die area. While some large word operations are required, a larger and
larger fraction of the operations executed by modern processors use only a small portion of the
wide datapath. The computationally critical portions of programs occupy only small portions of
the instruction and data cache.
We can continue to improve aggregate processor performance by using more silicon in this
manner, but the performance per unit area will steadily decrease. To the extent that silicon area
is inexpensive, task recompilation is hard or unacceptable, and various forms of parallelism are
difﬁcult to achieve, the current trends have their value.
However, to the extent we wish to engineer better silicon system which do more with less
resources, these trends are now yielding diminishing returns. We can manage more programmable
compute elements than a single, central word-wide, ALU on modern IC dies. Reconﬁguration
allows us to do this without paying a prohibitive costs for increased instruction distribution as we
go to more, independently controlled computing units.
15.4.2 Multiprocessors
The conventional view of multiprocessing is that we replicate the entire microprocessor and
place these replicas on the same board or die. At best, this allows aggregate performance to
improve with additionalarea dedicated to additional processors. However, it entails a large amount
of unnecessary cost, replicating entire processors when many portions of the processor may not
need to be replicated. Further, coupling between processors is poor, at best, entailing 10’s to 100’s
315of cycles of latency to move data from one processing element to another and signiﬁcant overhead
to coordinate the activities of multiple processing units.
Most of the task which have generally been “good” multiprocessor applications are very
regular computing tasks for which conﬁgured, systolic dataﬂow can provide more area efﬁcient
implementations. For the sake of intuition, consider an image processing task where we need to
perform 100 operations on each pixel. We can divide this task among conventional processors,
where each processor must have memory to hold the 100 operations and must pay overhead cycles
forcommunication,asnecessary, amongthe processors. Alternately,wecanconﬁgureahardware
pipeline to process the data. If we allocate 100 compute elements, each compute element in the
conﬁguredpipeline needsto onlyexecuteitsone operation. Direct connectionsbetween computing
elementstransportdataavoiding additionaloverheadcycles. Togetthe samethroughputas the 100
element systolic design, the multiprocessor implementation would need, at least, 100 processors.
In terms of instruction memory alone, the multiprocessor implementation requires memory area to
hold 9900 more instructions than the systolic implementation, making it signiﬁcantly larger just to
support the same throughput.
The traditional strength of microprocessors has been their ability to pack large computations
into small area by reusing central computing resources. This tight packing of functionality comes
at thecost of a decrease in computationaldensity as we saw in Chapter4 and Section15.3.2. When
we are willing to pay area to increase throughput, the traditional microprocessor architecture is
not efﬁcient since it brings with it the baggage of a large investment in instruction distribution,
instruction memory, and control which are unnecessary for highly regular tasks. Further, the
i/o structure of conventional processors is designed around heavy sequentialization, creating an
interconnect bottleneck which makes high throughput usage impractical.
31616. Review of Major Concepts
After reading this thesis, you should appreciate the following major concepts:
Our reconﬁgurable computing space, RP-space, is largely characterized by architectural
choicessurroundingthestorage,distribution,binding,andcontrolofinstructions. [Chapters8
and 9]
These choices about instruction resources, in turn, are largely responsible for deﬁning the
circumstancesunderwhicha givenarchitecturewithintheRP-spaceismostefﬁcient. [Chap-
ter 9]
Using a multilevel conﬁguration scheme, the deployment of chip resources, including those
forinstructions,canbedeferreduntilrun-time. Consequently,resourceallocation,instruction
distribution, and controlcan be tailored to the needs of the application,making sucha device
efﬁcientoverabroaderrangeofapplicationcharacteristicsthanarchitectureswhoseresources
are bound at fabrication time. [Chapter 13]
There are three primary consumers of area on reconﬁgurable components: (1) instructions,
(2) interconnect, and (3) intermediate data.
– Task descriptions (instructions) are small compared to their physical realizations.
[Chapter 7, Chapter 9, and Section 10.4]
– Nonetheless, instruction storage space is not trivial. A large number of instructions
(typically 10-100) often take up as much space as the active interconnect and computa-
tional elements required to actually perform the instruction. [Chapter 7, Chapter 9, and
Section 10.4]
– We can compress the area for an implementation by increasing the instruction to active
area ratio, but the beneﬁts diminish past the point where the total area for stored
instruction and data equal the active area on which they are evaluated. [Chapter 9]
– The “optimal” amount of each of these resources arise from different sources. [Sec-
tion 10.1]
Instructions and intermediates are dictated by the computational task to be per-
formed.
Activeinterconnectand,toalesserextentactivecomputeresources,aredictatedby
the ratio between desired computational throughput and primitive computational
speed.
Interconnect is the dominant feature determining device area in conventional FPGAs. [Sec-
tions 7.1, 7.6, and 7.7]
317Interconnect requirement growth is superlinear in array size. Consequently, either inter-
connect area will continue to grow relative to non-interconnect area, or gate utilization will
decrease as array sizes grow. [Sections 7.6 and 7.7]
Since the non-interconnectarea is trivial compared to network area for conventional FPGAs,
optimizing for gate utilization is often short sighted and can result in unnecessarily large
implementations. [Section 7.7]
There are two interconnect functions typically required to realize a computation – spatial
transport and temporal transport. To use silicon area most efﬁciently, these should be
separated and handled via different mechanisms. [Chapter 11, especially Section 11.1]
– Data values can be transported forward in time through registers or memories. While
this ties up register area for the period of transport, it is much cheaper than tying up
critical active, routing resources which occupy much more area.
– Activeinterconnectcaneasilybethedominantareafeatureonageneral-purposedevice.
It is used most efﬁciently when its resources are pipelined and reused at their capacity
level – i.e. wiresand switchesshould not sit idle holdinga value once it has propagated
past them. Rather, they should be redeployed to route new data once they have
performed their spatial transport task.
Memory plays two fundamental roles in reconﬁgurable computing architectures: (1) storage
for instructions,(2)retimingofintermediatedata. Bothrolesarisefromthesharingofexpen-
sive, active hardware resources among multiple logical functions. [Identiﬁed in Chapters 9
through 11 and summarized in Section 15.1]
Since interconnect is the major consumer of space on FPGAs, conventional architectures
limit the interconnect bydepopulatinginterconnect switchesas muchas possible. [Chapter7
especially Sections 7.4 and 7.5]
Physical place and route on devices with limited interconnect is computationally difﬁcult
because it is necessary to simultaneous satisfy a large number of constraints in order to ﬁnd
a valid mapping of the design netlist onto the physical network. [Chapter 12]
We can alleviate the place and route problem in several different ways, each with different
costs:
– Provide rich interconnect (e.g. HP PLASMA). Easier mapping comes at the cost of
greater cell area and lower computational density. [Section 12.8]
– Provide rich, time-switched interconnect (e.g. UCB DHARMA). Rigid evaluation
levels and lack of retiming can make this an expensive solution, as well, especially for
larger arrays. [Section 12.8]
– Provide rich retiming and time-switching (e.g. TSFPGA). Cell area can actually be
lower than conventional FPGAs, but is higher than in DPGAs. This scheme sacriﬁces
the high, peak computational throughput of traditional FPGAs. [Chapter 12]
318– Eliminate interconnect (e.g. University of Toronto VEGA). This approach saves some
additional area over DPGAs, but at the cost of signiﬁcantly lower computational
throughput and density than all other options. [Section 12.8]
Our focus and demonstration of these characteristics has been within the limited realm of RP-
space. Nonetheless, most of the features which characterize RP-space show up more generally in
general-purpose computational devices. Consequently, many of the characteristics identiﬁed here
may have broader application to the extent they are not dominated by effects abstracted away in
the RP-space model.
319Terminology
see tau.
see lambda.
active computing resources The portions of a general-purpose architecture which actually com-
pute results or transport data – e.g. ALUs, switches, wires. The term is typically used to
distinguishsuchresourcesfromoverheadresourcesusedtostoredescriptionsorintermediate
data.
active interconnect Switches and wires which actually produce a physical connection between
a source and a destination. The term is used to distinguish resources used to actually
perform switchingfromdescriptionsofswitchingoperationsor storageforintermediatedata.
Chapter 7 is primarily focussed on active interconnect, while Chapters 11 and 12 introduce
forms of switched interconnect where the distinction becomes quite important.
bit processing element A generic term for the primitive computational unit which produces one
bit of result. Conventionally,eachFPGALUT is a bit processingelement, asis eachbit-slice
in an SIMD ALU datapath. See Chapter 8.
context A generic term used to refer to a slice of instructions and intermediate data used by a
general-purpose device on a single cycle. See conﬁguration context and data context.
control stream An independent thread of execution. When the computation varies with time and
data, the control stream determines which sets of instructions are executed on a give cycle.
A computational device may support a single control stream (e.g. processors, SIMD, pure
VLIW) or multiple control streams (e.g. MSIMD, MIMD). See Section 8.5.
conﬁgurable computing Computingbyconﬁguringinterconnectbetweenprogrammablefunction
units to wire up computations spatially. See Sections 1.3 and 2.3.
conﬁgurable computing architectures Architectureswherethereisonlyoneorafewinstructions
loaded per active computing element and there is limited bandwidth to reload an entire
conﬁguration context. These architectures are used for conﬁgurable computing where the
computation is typically arranged via spatial interconnect of computing elementsas opposed
to programmablecomputingarchitectureswhichrealizecomputationby rapidtemporalreuse
of a few, central active computing resources. See Section 2.3.
320computational density See functional density.
computational throughput Computations performed per unit time. i.e. Operations completed
per unit time.
conﬁguration context The collection of bits which describe the behavior of a general-purpose
machine on one operation cycle. Equivalently, the collection of all instructions required to
specify the behavior of a general-purpose device at one point in time. See Section 2.3.
data context The data used by a general-purpose device on one cycle of execution.
distance delay The critical path delay through a placed circuit taking into account the distance
between logically adjacent functional units. See Section 12.6.
datapath granularity Datapath width. The number of bit processing elements or interconnect
switches controlled in SIMD fashion by a single instruction. See Section 8.3.1.
deployable resources Resources whose role can be determined at run-time. e.g. A memory
which can be used as an instruction store or as a data store; Interconnect which can be used
to distribute instructions or to deliver data between functional units. Distinguished from
resources which are dedicated to a single function at fabrication time. See Section 13.1.
dynamic Marked by a continuous usually productive activity or change. In this context usually
used to distinguish quantities, particularly, instructions, which change on a cycle-by-cycle
basis. Contrast with static and quasistatic. See Section 10.3.4.
dynamic instruction distribution Instruction distribution allowing instructions to change on a
cycle-by-cycle basis. See Section 10.3.4.
DPGA Dynamically Programmable Gate Array – Fine-grained programmable array where each
processing elementhas a small, local conﬁguration memory allowing processingelementsto
change instructions, array-wide, on a cycle-by-cycle basis. See Chapters 10 and 11.
FPGA Field Programmable Gate Array – A collection of conﬁgurable processing units embedded
in a conﬁgurable interconnection network. See Sections 2.4 and 4.5.
functional density Computations performed per unit space-time. Usually measured in Ops 2s.
See Section 2.6.1.
functional diversity The number of different functions which are resident and rapidly accessible
from a unit of computational area. The density of instructions stored on a general-purpose
computing device. See Section 2.6.2.
general-purpose computing Computing using devices which can be conﬁgured to solve any
number of computing tasks. See Section 2.1.
iDPGA DynamicallyProgrammableGateArraywithinputretimingregisters–ADPGAincluding
input retiming registers. See Chapter 11.
321input depth The temporal range of input retiming registers in the iDPGA or similar architectures.
See Chapter 11.
input folding Astyle forreducingtheamountofactive switchinginterconnectbysharingcrossbar
inputs among multiple sources. See Section 12.2.
instruction Thesetofbitswhichdescribethebehaviorofonecomputationalunitanditsassociated
interconnect. See Section 2.2.2.
instruction context See conﬁguration context.
instruction density See functional diversity.
instruction depth Number of instructions per compute element stored local to the compute ele-
ment.
irregular computing task Task which require a large sequence of different computations and
where operations are heavily data-dependent. See Section 2.5.
Kolmogorov complexity Ofall programswhichcan be usedto calculatea particular setof values,
the length of the smallest such one. Ultimately, this is the least number of bits into which a
pieceofdatacanbedescribed. Kolmogorovcomplexityis,primarily,aconceptualdescription
ofthelowerboundasthereis noalgorthimicwayto ﬁndsuchthebound. Seeanyinformation
theory text such as [CT91].
lambda ( ) – half the minimum feature size in a silicon process. Lambda is used to normalize out
the effects of different process sizes when comparing implementations. Area normalized to
2 units is roughly comparable between processes which differ primarily in feature size. See
Section 2.6.1.
low instruction entropy Computing tasks which require a limited set of operations with very
regular ﬂow, admitting to heavy compression of instruction distribution requirements. See
Section 8.3.
lookup table A small, typically programmable, memory where the address bits act as inputs and
data read out serves as an output. An -input, -output lookup table can implement any,
deterministic mapping between input bits and output bits. We frequently refer to a
-input, 1-output lookup table as a -LUT. See Section 2.4.
LUT Look Up Table – see lookup table.
MATRIX Multiple ALU architecture with Reconﬁgurable Interconnect – A ﬂexible general-
purposecomputingarchitecturewhichdefersbindingofinstructionsandinstructionresources
until use. Instruction storage and distribution resources are uniﬁed with datapath compute,
memory, and interconnect resources, allowing thebasic instruction architectureto be deﬁned
at run-time. See Chapter 13.
322metaconﬁguration Ahigherandmoreprimitivelevelofconﬁgurationthantraditionalinstructions
which deﬁnes the sources and distribution paths for dynamic control including instructions.
See multi-level conﬁguration. See Section 13.1 and 10.8.2.
microcycle Oneprimitivemachinecycleonarchitectureswhichevaluatelogicaltasks overseveral
smaller clock cycles. See Section 10.5.1. Microcycle evaluation is a common theme in
Chapters 10 through 13.
multicontext Having more than one conﬁguration for the entire general-purpose device. Usually
usedtorefertodevicesorarchitectureswhichholdmultiplesuchconﬁgurationsonchip. Also
usedtodescribeevaluationschemeswhichcomputearesultusingmorethanonedevice-wide
conﬁguration. See Chapter 10.
multi-level conﬁguration Hierarchicalconﬁgurationwherehigherlevelsofconﬁgurationdescribe
the architecture, behavior, and distribution used by lower levers of conﬁguration. See
metaconﬁguration. See Section 13.1.
output folding Astyleforreducingtheamountofactiveswitchinginterconnectbysharingcrossbar
outputs among multiple sinks. See Section 12.2.
partial reconﬁguration The ability for individual or small numbers of processing units to change
instructions without requiring an entire reload of all instructions across a general-purpose
computing device. See Sections 8.3.3 and 10.3.4.
quasistatic Changing, but on an time scale much slower than standard operation. An intermediate
point of activity between dynamic and static.
quasistatic instruction distribution Instructions which change during an application, but do so
slowly compared to the rate of execution. A quasistatic instruction might be in effect for
hundreds of cycles before changing. See Section 10.3.4.
Rent’s Rule An empirical relationship between the number of i/o’s in and out of a cluster of logic
and the number of logical elements inside the logic ( ). See Section 7.6.
regular computing task Taskswhichneedtorepeatedlyperformthesamecollectionofoperations
to a large amount of data with little data-dependent ﬂow control. See Section 2.5.
retiming Changingthetimeatwhichparticular eventsoccur. Inthiswork, usedlargelyto describe
thetransportationofsignalsforwardintimebetweenthepointintimewhentheyaregenerated
to the point in time when they are consumed. See Section 10.1. Retiming is a major theme
in Chapters 10 through 12.
robust architectural points Design points where we can bound the inefﬁciency to some constant
percentage when the task has different characteristics from the architecture. See Chapter 9
starting in Section 9.3.
RP-space A high-levelabstractionof the reconﬁgurablecomputingdesign space parameterizedby
key instruction and interconnect features. See Chapter 9.
323run-time reconﬁguration The abilityto change deviceconﬁgurationduring acomputationaltask.
segmentable datapath A SIMD controlled -bit datapath which can be dynamically or quasistat-
ically reconﬁgured to treat the datapath as , -bit words, for certain, restricted, values of .
See Section 13.4.
subarray An organizational unit in array architectures composed of multiple processing elements
but not the entire device. In the DPGA and TSFPGA, the subarray deﬁnes the extent of
local interconnect and the set of processingelements which share common resources such as
decoders and instruction distribution. See Section 10.4.1.
spatial transport Movement of intermediate data in space from the point of production to the
point of consumption. See Section 11.1.
static Showing little change; characterized by a lack of movement, animation or progression. In
this context usedprimarily to distinguishvalues and instructionswhich donot change during
an operational epic. Contrast with static and quasistatic. See Section 10.3.4.
static instruction distribution Instruction distribution where instructions are set at the beginning
of a computational task and do not changed during execution. See Section 10.3.4.
programmable computing architectures General-purpose computing architectures which heav-
ily and rapidly reuse a single or small number of active computing resources for many
different functions (e.g. conventional microprocessors). See Section 2.3.
tau ( ) The delay parameter for a process. One is the delay required for one inverter to drive a
single, equally large inverter.
temporal pipelining Reusing general-purpose resources in time to evaluate different components
of a single logical task. Like spatial pipelining, the result is produced after traversing a
number of pipelining stages. Unlike spatial pipelining, the same physical resources are used
to evaluate each stage of the pipeline. Temporal pipelining reduces spatial requirements,
whereas spatial pipelining increases throughput. See Sections 10.1 and 10.5.1.
temporal transport Movement of intermediate data in time from the microcycle on which the
value is produced to the one where it is consumed. See Section 11.1.
timestep A particular microcycle in the evaluation of a computing task. See Section 12.1.
time-switched input register An input register supporting data retiming on architectures which
time-switch theirinterconnect. Theinput registerloads the valuefromits associatednetwork
output only when the current timestep matches a programmed value. See Section 12.1.
TSFPGA Time-Switched Field Programmable Gate Array – Fine-grained programmable array
where the physical interconnect is shared and switched in time. See Chapter 12.
yielded computational density The effective computational density which an application or task
extracts from a computational device. Mismatches in datapath granularity, interconnect
324richness, or control may cause a device to provide computational capacity below its peak.
See Section 2.6.1 and examples given in Chapter 4.
325Bibliography
[ABI 95] K. Asanovic, J. Beck, B. Irissou, D. Kingsbury, N. Morgan, and J. Wawrzynek.
The T0 Vector Microprocessor. In Hot Chips VII Proceedings, August 1995.
[ACC 96] RickAmerson,RichardCarter, W.BruceCulbertson,PhilKuekes,andGregSnider.
Plasma: An FPGA for Million Gate Systems. In Proceedings of the International
Symposium on Field Programmable Gate Arrays, pages 10–16, February 1996.
[ADD90] Creigton Asato, Christoph Ditzen, and Suresh Dholakia. A Data-Path Multiplier
with AutomaticInsertion of Pipeline Stages. IEEE Journal of Solid-StateCircuits,
25(2):383–387, August 1990.
[AFM 89] Kazutami Arimoto, Kazuyasu Fujishima, Yoshio Matsuda, Masaki Tsukude,
Tukasa Oishi, Wataru Wakamiya, Shin-Ichi Satoh, Michihiro Yamada, and Takao
Nakano. A 60-ns3.3-V-Only16-MbitDRAM witha MultipurposeRegister. IEEE
Journal of Solid-State Circuits, 24(5):1176–1183, October 1989.
[AKY 96] Yoshiharu Aimoto, Tohru Kimura, Yoshikazu Yabe, Hideki Heiuchi, Youetsu
Nakazawa,MasatoMotomura,TakuyaKoga,YoshihiroFujita,MasayukiHamada,
TakahoTanigawa,HajimeNobusawa,andKuniakiKoyama.A7.68GIPS3.84GB/s
1W Parallel Image-Processing RAM Integrating a 16Mb DRAM and 128 Pro-
cessors. In 1996 IEEE International Solid-State Circuits Conference, Digst of
Technical Papers, pages 372–373. IEEE, February 1996.
[AL94] AdityaA. Agarwal and DavidLewis. RoutingArchitecturesfor HierarchicalField
Programmable Gate Arrays. In Proceedings 1994 IEEE International Conference
on Computer Design, pages 475–478. IEEE, October 1994.
[Alg90] Algotronix Ltd., Edinburgh, UK. The Conﬁgurable Logic Data Book, 1990.
[Alt94] Altera Corporation, 2610 Orchard Parkway, San Jose, CA 95134-2020. FLEX
8000 Handbook, May 1994.
[Alt95] AlteraCorporation,2610OrchardParkway, SanJose,CA95134-2020. DataBook,
March 1995.
[Alt96] Altera Corporation, 2610 Orchard Parkway, San Jose, CA 95134-2020. Digital
Signal Processing in FLEX Devices, January 1996.
326[ANAB 92] Fuad Abu-Nofal, Rick Avra, Kanti Bhabuthmal, Rob Bhamidipaty, Greg Blanck,
Andy Charnas, Peter DelVecchio, Joe Grass, Joel Grinberg, Norm Hayes, George
Haber, Jim Hunt, Govind Kizhepat, Adam Malamy, Al Marston, Kaushal Mehta,
Sunil Nanda, Hoa Van Nguyen, Rajiv Patel, Andy Ray, Jim Reaves, Alan Rogers,
StefanRusu,TomShay,IrwanSidharta,TerryTham,PeterTong,RichardTrauben,
Anthony Wong, David Yee, Naeem Maan, Don Steiss, and Lynn Youngs. A
Three-Million-TransistorMicroprocessor. In 1992 IEEE International Solid-State
Circuits Conference, Digest of Technical Papers, pages 108–109. IEEE, February
1992.
[ANH 88] Masakazu Aoki, Yoshinobu Nakagome, Masashi Horiguchi, Hitoshi Tanaka,
Shin’ichi Ikenaga, Jun Etoh, Yoshifumi Kawamoto, Shin’ichiro Kimura, Eiji
Takeda, Hideo Sunami, and Kiyoo Itoh. A 60-ns 16-Mbit CMOS DRAM with a
TransposedData-LineStructure.IEEEJournalofSolid-StateCircuits,23(5):1113–
1119, October 1988.
[AOT 94] Mikio Asakura, Tsukasa Ooishi, MAsaki Tsukude, Shigeki Tomishima, Takahisa
Eimori, Hideto Hidaka, YoshikazuOhno, KazutaniArimoto, Kazuyasu Fujishima,
Tadashi Nishimura, and Tsutomu Yoshihara. An Experimental 256-Mb DRAM
with Boosted Sense-Ground Scheme. IEEE Journal of Solid-State Circuits,
29(11):1303–1308, November 1994.
[AS93] Peter Athanas and Harvey F. Silverman. Processor Reconﬁguration Through
Instruction-Set Metamorphosis. IEEE Computer, 26(3):11–18, March 1993.
[ASO 90] Shingo Aizaki, Toshiyuki Shimizu, Masayoshi Ohkawa, Kazuhiko Abe, Akane
Aizaki, Manabu Ando, Osamu Kudoh, and Isao Sasaki. A 15-ns 4-Mb CMOS
SRAM. IEEE Journal of Solid-State Circuits, 25(5):1063–1067, October 1990.
[Atm94] Atmel Corporation, 2125 O’Nel Drive, San Jose, CA 95131. Conﬁgurable Logic
Design and Application Book, 1994.
[ATT94] ATT Microelectronics, 555 Union Boulevard, Room 21Q-133BA, Allentown, PA
18103. Immplementing and Optimizing Multipliers in ORCA FPGAs, November
1994.
[ATT95] ATT Microelectronics, 555 Union Boulevard, Room 21Q-133BA, Allentown, PA
18103. AT&T Field-Programmable Gate Arrays Data Book, April 1995.
[AWG94] Lalit Agarwal, Mike Wazlowski, and Sumit Ghosh. An Asynchronous Approach
to Efﬁcient Execution of Programs on Adaptive Architectures Utilizing FPGAs.
In Duncan Buell and Ken Pocek, editors, Proceedings of the IEEE Workshop
on FPGAs for Custom Computing Machines, pages 101–1100, Los Alamitos,
California, April 1994. IEEE Computer Society, IEEE Computer Society Press.
[BAB 95] William Bowhill, Randy Allmon, Shane Bell, Elizabeth Cooper, Dale Donchin,
John Edmondson, Timothy Fischer, Paul Gronowski, Anil Jain, Patricia Kroe-
sen, Bruce Loughlin, Ronald Preston, Paul Rubinfeld, Michael Smith, Stephen
327Thierauf, and Gilbert Wolrich. A 300MHz 64b Quad-Issue CMOS RISC Micro-
processor. In 1995 IEEE International Solid-State Circuits Conference, Digest of
Technical Papers, pages 182–183. IEEE, February 1995.
[BBB 95] David Bearden, Roger Bailey, Brad Beavers, Carlos Gutierrez, Chin-Cheng Kau,
Kurt Lewchuk, Paul Rossback, and Mike Tabom. A 133MHz 64b Four-Issue
CMOS Microprocessor. In 1995 IEEE International Solid-State Circuits Confer-
ence, Digest of Technical Papers, pages 174–175. IEEE, February 1995.
[BCE 94] Jeremy Brown, Derrick Chen, Ian Eslick, Edward Tau, and Andr´ e DeHon. A 1
CMOS Dynamically Programmable Gate Array. Transit Note 112, MIT Artiﬁcial
IntelligenceLaboratory, November1994. AnonymousFTPtransit.ai.mit.
edu:transit-notes/tn112.ps.Z.
[BCH 84] Erich K. Baier, Rainer Clemen, Werner Haug, Walter Fischer, Rolf Mueller,
Wolf Dieter Loehlein, and Horst Barsuhn. A Fast 256K DRAM Designed for
a Wide Range of Applications. IEEE Journal of Solid-State Circuits, 19(5), Octo-
ber 1984.
[BCK93] Narasimha B. Bhat, Kamal Chaudhary, and Ernest S. Kuh. Performance-Oriented
Fully Routable Dynamic Architecture for a Field Programmable Logic Device.
UCB/ERL M93/42, University of California, Berkeley, June 1993.
[BDK94] Michael Bolotski, Andr´ e DeHon, and Thomas F. Knight, Jr. Unifying FPGAs and
SIMD Arrays. In FPGA Workshop, 1994. proceedings not available outside of the
workshop; paper available as Transit Note #95 Anonymous FTP transit.ai.
mit.edu:transit-notes/tn95.ps.Z. AnonymousFTPtransit.ai.
mit.edu:papers/dpga-fpga94.ps.Z.
[BDN84] John J. Barnes, Armando L. DeJesus, and David Novosel. Circuit Techniques
for a 25 ns 16K 1 SRAM Using Address-Transition Detection. IEEE Journal of
Solid-State Circuits, 19(4):455–460, August 1984.
[BFRV92] Stephen D. Brown, Robert J. Francis, Jonathan Rose, and Zvonko G. Vranesic.
Field-ProgrammableGateArrays. KluwerAcademicPublishers,101PhilipDrive,
Assinippi Park, Norwell, Massachusetts, 02061 USA, 1992.
[Bha93] NarasimhaB. Bhat. NovelTechniques for High PerformanceField Programmable
LogicDevices. UCB/ERL M93/80, University of California, Berkeley, November
1993.
[BLMR83] TedBurggraff, Al Love, RichardMalm,and AnnRudy. The IBMLos GatosLogic
SimulationMachine Hardware. In Proceedingsof theInternationalConferenceon
Computer Design, pages 584–587, October 1983.
[BMNW87] Gerald Boudun, Pierre Mollier, Jean Nuez, and Franck Wallart. A 30ns-32b Pro-
grammable Arithmetic Operator. In 1987 IEEE International Solid-State Circuits
Conference, Digst of Technical Papers, pages 54–55. IEEE, February 1987.
328[Bri90] Timothy Bridges. The GPA Machine: A Generally Partitionable MSIMD Archi-
tecture. In Proceedings of the Third Symposium on The Frontiers for Massively
Parallel Computations, pages 196–202. IEEE, 1990.
[Bro92] Stephen Brown. Routing Algorithms and Architectures for Field-Programamble
Gate Arrays. PhD thesis, University of Toronto, January 1992.
[BRV89] Patrice Bertin, Didier Roncin, and Jean Vuillemin. Introduction to Programmable
Active Memories. PRL Report 3, DEC Paris Research Laboratory, 85, Av. Victor
Hugo, 92563 Rueil-Malmaison Cedex, France, June 1989.
[BRV92] Patrice Bertin, Didier Roncin, and Jean Vuillemin. Programmable Active Memo-
ries: A Performance Assessment. Prl report, DEC Paris Reserch Laboratory, 85,
Av. Victor Hugo, 92563 Rueil-Malmaison Cedex, France, June 1992.
[BSV 95] Michael Bolotski, Thomas Simon, Carlin Vieri, Rajeevan Amirtharajah, and
Thomas F. Knight Jr. Abacus: A 1024 Processor 8ns SIMD Array. In Ad-
vanced Research in VLSI 1995, 1995. Anonymous FTP ftp.ai.mit.edu:
pub/users/misha/arvlsi95.ps.gz.
[BTA93] Jonathan Babb, Russell Tessier, and Anant Agarwal. Virtual Wires: Overcoming
Pin Limitations in FPGA-based Logic Emulators. In Duncan A. Buell and Ken-
neth L. Pocek, editors, Proceedings of the IEEE Workshop on FPGAs for Custom
ComputingMachines,pages142–151,LosAlamitos, California,April1993.IEEE
Computer Society, IEEE Computer Society Press.
[CBBF87] CraigCaren, Bruce Benjamin, James Boddie, and Michael Fuccio. A 60ns CMOS
DSP with On-Chip Instruction Cache. In 1987 IEEE International Solid-State
Circuits Conference, Digst of Technical Papers, pages 156–157. IEEE, February
1987.
[CC86] Remi Cissou and Remy Chapelle. A High-Speed 640kbit CMOS RAM. IEEE
Journal of Solid-State Circuits, 21(3):390–396, June 1986.
[CCS 91] Terry Chappell, Barbara Chappell, Stanley Schuster, James Allan, Stephen Klep-
ner, Rajiv Joshi, and Robert Franch. A 2-ns Cycle, 3.8ns Access 512-kb CMOS
ECL SRAM with a Fully Pipelined Architecture. IEEE Journal of Solid-State
Circuits, 26(11):1577 ff., November 1991.
[CD96] Derrick Chen and Andr´ e DeHon. TSFPGA: A Fine-Grain Reconﬁgurable Archi-
tecture with Time-Switched Interconnect. Transit Note 134, MIT Artiﬁcial Intel-
ligence Laboratory, January 1996. Anonymous FTP transit.ai.mit.edu:
transit-notes/tn134.ps.Z.
[CDd 95] A. Charmas, A. Dalal, P. deDood, P. Ferolito, B. Frederick, O. Geva, D. Greenhill,
H.Hingarh, J.Kaku, L.Kohn, L.Lev, M.Levitt, R.Melanson,S.Mitra,R.Sundar,
M.Tamjidi, P.Wang, D.Wendell, R.Yu,andG. Zyner. A64bMicroprocessorwith
329MultimediaSupport. In 1995 IEEE InternationalSolid-State Circuits Conference,
Digest of Technical Papers, pages 178–179. IEEE, February 1995.
[CDF 86] William S. Carter, Khue Duong, Ross H. Freeman, Hung-Cheng Hsieh, Jason Y.
Ja, John E. Mahoney, Luan T. Ngo, and Shelly L. Sze. A User Programmable
ReconﬁgurableLogicArray. InIEEE1986CustomIntegratedCircuitsConference,
pages 233–235. IEEE, May 1986.
[CDF 95] Jonathan Change, Anand Dharmaraj, Michael Filardo, Astushi Ike, Bala Joshi,
Takeshi Kitahara, Anand Krishnamoorthy, Simon Li, Sanjay Mansingh, Osamu
Moriyama, Arvind Narayan, Kesiraju Rao, Murugappan Ramaswami, Farnad Saj-
jadian,MikeSimone,GeneShen,RaviSwami, JohnSzeto,VijiThirumalaiswamy,
Shalesh Thusoo, and DeFrost Tovey. SPARC64+: HaL’s Second Generation 64-
bit SPARC Processor. In Proceedings of Hot Chips VII, page 3.2, August 1995.
http://www.hal.com/docs/PS/sparc64_plus.ps.
[CDH 88] Sow Chu, Jan Dikken, Cornelis Hartgring, Frans List, John Raemaekers, Simon
Bell, Brendan Walsh, and Roelof Salters. A 25-ns Low-Power Full-CMOS 1-
Mbit (128K 8) SRAM. IEEE Journal of Solid-State Circuits, 23(5):1078–1084,
October 1988.
[CH84] Larry F. Childs and Ryan T.Hirose. An18 ns 4K 4CMOSSRAM. IEEE Journal
of Solid-State Circuits, 19(5):545–551, October 1984.
[Cha93] Kenneth David Chapman. Fast Integer Multipliers ﬁt in FPGAs. EDN,
39(10):80,May121993. AnonymousFTPwww.ednmag.com:EDN/di_sig/
DI1223Z.ZIP .
[Cho89] Paul Chow, editor. The MIPS-X RISC Microprocessor. Kluwer Academic Pub-
lishers, 1989.
[CKC 89] Daeje Chin, Changhyun Kim, Yunho Choi, Dong-Sun Min, Hong Sun Hwang,
Hoon Choi, Sooin Cho, Tae Young Chung, Chan J. Park, Yunseung Shin, Kwang-
pyuk Suh, and Yong Park. An Experimental 16-Mbit DRAM with Reduced Peak-
Current Noise. IEEE Journal of Solid-State Circuits, 24(5):1191–1198, October
1989.
[Cla95] PeterClarke. PilkingtonPrepsReconﬁgurableVideoDSP. ElectronicEngineering
Times, page 16, August 7 1995. Online brieﬁng http://www.pmel.com/
dsp.html.
[cLCWMS96] ChihchangLin,Douglas Chang,Yu-LiangWu, andMalgorzataMarek-Sadowska.
Time-Multiplexed Routing Resources for FPGA Design. In Proceedings of the
Custom Integrated Circuits Conference, May 1996.
[CLRA90] MikeCai,DanielLuthi, PeterRuetz,and PengAng. A40 MHzProgrammableand
Reconﬁgurable Filter Processor. In Proceedings of the 1990 Custom Integrated
Circuits Conference, pages 13.2.1–13.2.4. IEEE, May 1990.
330[CME93] Chi-Jui Chou, Satish Mohanakrishnan, and Joseph B. Evans. FPGA Implemen-
tation of Digital Filters. In International Conference on Signal Processing Ap-
plications and Technology, 1993. Anonymous FTP ftp.tisl.ukans.edu:
pub/projects/DSP/FPGA/Digital_Filters.ps.
[CR92] Dev C. Chen and Jan M. Rabaey. A Reconﬁgurable Multiprocessor IC for Rapid
Prototyping of Algorithmic-Speciﬁc High-Speed DSP Data Paths. IEEE Journal
of Solid-State Circuits, 27(12):1895–1904, December 1992.
[CSA 91] Paul Chow, Soon Ong Seo, Dennis Au, Terrence Choy, Bahram Fallah, David
Lewis, Cherry Li, and Jonathan Rose. A 1.2 m CMOS FPGA using Cascaded
Logic Blocks and Segmented Routing. In Will Moore and Wayne Luk, editors,
FPGAs, pages 91–102. Abingdon EE&CS Books, 15 Harcourt Way, Abingdon,
OX14 1NV, UK, 1991.
[CT91] Thomas Cover and Joy Thomas. Elements of InformationTheory. John Wiley and
Sons, Inc., New York, 1991.
[CTK 89] Shizuo Chou, Tsuneo Takano, Akio Kita, Fumio Ichikawa, and Masaru Uesugi. A
60-ns 16-Mbit DRAM with a Minimized Sensing Delay Caused by Bit-Line Stray
Capacitance. IEEE Journal of Solid-State Circuits, 24(5):1176–1183, October
1989.
[D 92] William J. Dally et al. The Message-DrivenProcessor: A MulticomputerProcess-
ing Node with Efﬁcient Mechanisms. IEEE Micro, pages 23–39, April 1992.
[DeH94] Andr´ e DeHon. DPGA-Coupled Microprocessors: Commodity ICs for the Early
21st Century. In Proceedings of the IEEE Workshop on FPGAs for Custom
Computing Machines, April 1994. Anonymous FTP transit.ai.mit.edu:
papers/dpga-proc-fccm94.ps.Z.
[Den82] Monty Denneau. The Yorktown Simulation Engine. In 19th Design Automation
Conference, pages 55–59. IEEE, 1982.
[DMNSV88] Srinivas Devadas, Hi-Keung Ma, A.R. Newton, and Alberto Sangiovanni-
Vincentelli. MUSTANG: State Assignment of Finite State Machines Targeting
Multilevel Logic Implementations. IEEE Transactions on Computer-Aided De-
sign of Integrated Circuits and Systems, 7(12):1290–1300, December 1988.
[Don74] Wilm E. Donath. Equivalence of Memory to “Random Logic”. IBM Journal of
Research and Development, 18(5):401–407, September 1974.
[Don79] Wilm E. Donath. Placement and Average Interconnection Lengths of Computer
Logic. IEEE Transactions on Circuits and Systms, 26(4):272–277, April 1979.
[Dur94] Serge Durand. FPGA DLX processor. August 22, 1994 posting to comp.arch.
fpga. Author may be reached at durand@lslsun4.epfl.ch, December
1994.
331[DWA 92] Daniel Dobberpuhl, Richard Witek, Randy Allmon, Robert Anglin, Sharon Brit-
ton,LindaChao,RobertConrad,DanielDever, BruceGieseke,GregoryHoeppner,
John Kowaleski, Kathryn Kuchler, Maureen Ladd, Michael Leary, Liam Madden,
Edward McLellan, Derrick Meyer, James Montanaro, Donald Priore, Vidya Ra-
jagopalan, Sridhar Samudrala, and Sribalan Santhanam. A 200MHz 64b Dual-
Issue CMOS Microprocessor. In 1992 IEEE International Solid-State Circuits
Conference, Digest of Technical Papers, pages 106–107. IEEE, February 1992.
[EG95] Andrew Essen and Stephen Goldstein. Performance Evaluation of the Superscalar
Speculative Execution HaL SPARC64 Processor. In Proceedings of Hot Chips
VII, page 3.1, August 1995. http://www.hal.com/docs/PS/sparc64_
perf.ps.
[EH94] James G. Eldredge and Brad L. Hutchings. Density Enhancement of a Neural
Network Using FPGAs and Run-Time Reconﬁguration. In Duncan A. Buell and
Kenneth L. Pocek, editors, Proceedings of the IEEE Workshop on FPGAs for
Custom Computing Machines, pages 180–188, Los Alamitos, California, April
1994. IEEE Computer Society, IEEE Computer Society Press.
[Eps95] Dave Epstein. Chromatic Raises the Multimedia Bar. Microprocessor Report,
9(14):23ff., October 23 1995. http://www.chipanalyst.com/report/
report9_14/page23.html.
[FA93] Jahil Fadavi-Ardekani. Booth Encoded Multiplier Generator Using Opti-
mized Wallace Trees. IEEE Transactions on Very Large Scale Integration (VLSI)
Systems, 1(2):120–125, June 1993.
[FHR94] AllanFisher, PeterHighnam,andToddRockoff. AFour-ProcessorBuildingBlock
for SIMD Processor Arrays. IEEE Journal of Solid-State Circuits, 25(2):369–375,
April 1994.
[FHT 92] Hiroshige Fujii, Chikahiro Hori, Tomoji Takada, Naoyuki Hatanaka, Tatsuhiko
Demura, and Goichi Ootomo. A Floating-Point Cell Library and a 100-MFLOPS
Image Signal Processor. IEEE Journal of Solid-State Circuits, 27(7):1080–1088,
July 1992.
[FKM83] Allan L. Fisher, H. T. Kung, and Louis M. Monier. Architecture of the PSC: A
Programmable Systolic Chip. In Proceedings of the 10th Annual International
Symposium on Computer Architecture, pages 48–53, June 1983.
[FKS91] Richard Forsyth, Bob Krysiak, and Roger Shepherd. T9000 – Superscalar Trans-
puter. In Proceedings of Hot Chips III, pages 8.15–8.25, August 1991.
[Fly66] MichaelJ. Flynn. VeryHigh SpeedComputingSystems. Proceedingsof theIEEE,
54:1901–1909, 1966.
[Fly72] Michael J. Flynn. Some Computer Organizations and Their Effectiveness. IEEE
Transactions on Computers, C-21(9):948–960, September 1972.
332[FM82] C. M. Fiduccia and R. M. Mattheyses. A Linear Time Heuristic for Improving
Network Partitions. In Proceedings of the 19th Design Automation Conference,
pages 175–181, 1982.
[FOS 89] SyusoFujii, MasakiOgihara,MitsuruShimizu,MunehiroYoshida, KenjiNumata,
Takahiko Hara, Shigeyoshi Watanabe, Shizuo Sawada, Tomohisa Mizuno, Jun-
pei Kumagai, Susumu Yoshikawa, Seiji Kaki, Yoshikazu Saito, Hideaki Aochi,
Takeshi Hamamoto, and Koichi Toita. A 45-ns 16-Mbit DRAM with Triple-well
Structure. IEEE Journal of Solid-State Circuits, 24(5):1170–1175, October 1989.
[Fos96] Richard Foss. Implementing Application Speciﬁc Memory. In 1996 IEEE Inter-
national Solid-State Circuits Conference, pages 260–261. IEEE, February 1996.
[FOW 86] Tohru Furuyama, Takashi Ohshawa, Yohji Watanabe, Hidemi Ishiuchi, Toshiharu
Watanabe, Takeshi Tanaka, Kenji Natori, and Osamu Ozawa. An Experimental 4-
Mbit CMOS DRAM. IEEE Journal of Solid-State Circuits, 21(5):605 ff., October
1986.
[FPH 90] Stephen Flannagan, Perry Pelley, Norman Herr, Bruce Engles, Taisheng Feng,
Scott Nogle, John Eagan, Robert Dunnigan, Lawrence Day, and Roger Kung. 8-
ns CMOS 64K 4 and 256K 1 SRAM’s. IEEE Journal of Solid-State Circuits,
25(5):1049–1054, October 1990.
[Fra92] Robert Francis. Technology Mapping for Lookup-Table Based Field-
Programmable Gate Arrays. PhD thesis, University of Toronto, 1992.
[Fre94] Philip Freidin. R16: A 20MHz 16-bit RISC Processorin a XC4005. Informalpre-
sentation at FCCM’94 and comp.arch.fpga posting. Author may be reached
at fliptron@netcom.com, April 1994.
[FRV 86] Stephen Flannagan, Paul Reed, Peter Voss, Scott Nogle, Lawrence Day, David
Sheng, John Barnes, and Roger Kung. Two 13-ns 64K CMOS SRAM’s with Very
LowActivePowerandImprovedAsynchronousCircuitTechniques. IEEEJournal
of Solid-State Circuits, 21(5):692–703, October 1986.
[FSO 86] Syuso Fujii, Shozo Saito, Yoshio Okada, Masayuki Sato, Shizuo Sawada, Satoshi
Shinozaki, Kenji Natori, and Osamu Ozawa. A 50- A Standby 1M 1/256K 4
CMOS DRAM with High-Speed Sense Ampliﬁer. IEEE Journal of Solid-State
Circuits, 21(5):643–647, October 1986.
[Gam81] Abbas El Gamal. Two-Dimensional Stochastic Model for Interconnections in
Master Slice Integrated Circuits. IEEE Transactions on Circuits and Systems,
28(2):127–138, February 1981.
[GBB 96] Paul Gronowski, Peter Bannon, Michael Bertone, Randel Blake-Campos, Gre-
goryBouchard, William Bowhill, DavidCarlson, Ruben Castelino, Dale Donchin,
RichardFromm, Mary Gowan, Anil Jain, Bruce Loughlin, Shekhar Mehta, Jeanne
333Meyer, Robert Mueller, Andy Olesin, Tung Pham, Ronald Preston, and Paul
Robinfeld. A 433MHz 64b Quad-Issue RISC Microprocessor. In 1996 IEEE
International Solid-State Circuits Conference, Digest of Technical Papers, pages
222–223. IEEE, February 1996.
[GGA 85] Abbas El Gamal, David Gluss, Peng-Huat Ang, Jonathan Greene, and Justin
Reyneri. A CMOS 32b Wallace Tree Multiplier-Accumulator. In 1985 IEEE
International Solid-State Circuits Conference, Digst of Technical Papers, pages
194–195. IEEE, February 1985.
[GHH 96] Henry Green, Scott Harper, Rhett Hudson, Wencheng Li, Daniel Lough, Qiang
Lu, Shah Musa, Brenda O’Connor, Kevin Paar, and Peter Athanas. The Hokie
Instant RISC Microprocessor. WWW http://www.ee.vt.edu/courses/
ee6504_athanas/rapid.html, 1996.
[GHK 91] Maya Gokhale, William Holmes, Andrew Kopser, Sara Lucas, Ronald Minnich,
DouglasSweely, andDanielLopresti. Buildingand UsingaHighlyProgrammable
Logic Array. IEEE Computer, 24(1):81–89, January 1991.
[GHS 87] Will Gubbels, Cornelis Hartgring, Roelof Salters, Jos Lammerts, Michael Tooher,
PatrickHens,JosephBastiaens,JanDijk,andMarcSprokel. A40-ns/100-pFLow-
Power Full-CMOS 256K (32K 8) SRAM. IEEE Journal of Solid-State Circuits,
22(5):741 ff., October 1987.
[GK89] John Gray and Tom Kean. Conﬁgurable Hardware: A New Paradigm for Com-
putation. In Charles Seitz, editor, Advanced Research in VLSI: proceedings of teh
Decennial Caltech Conference on VLSI, pages 279–295, March 1989.
[GM93] Maya Gokhale and Ron Minnich. FPGA Computing in a Data Parallel C. In
Duncan A. Buell and Kenneth L. Pocek, editors, Proceedings of the IEEE Work-
shop on FPGAs for Custom Computing Machines, pages 94–101, Los Alamitos,
California, April 1993. IEEE Computer Society, IEEE Computer Society Press.
[GN94] Greg Goslin and Bruce Newgard. 16-TAP, 8-Bit FIR Filter Applications Guide.
Xilinx, Inc., 2100 Logic Drive, San Jose, CA 95124, November 1994. http:
//www.xilinx.com/appnotes/fir_filt.pdf.
[GNAB93] Jeffrey Gray, Andrew Naylor, ArthurAbnous, and NaderBagherzadeh. VIPER:A
VLIWIntegerMicroprocessor. IEEEJournalofSolid-StateCircuits,28(12):1377–
1382, December 1993.
[GNC 90] Carla Golla, Fulvio Nava, Franco Cavallotti, Alessandro Cremonesi, and Giulio
Casagrande. 30-MSamples/s Programmable Filter Processor. IEEE Journal of
Solid-State Circuits, 25(6):1502–1509, December 1990.
[GOI95] Eric Gayles, Robert Owens, and Mary Jane Irwin. The MGAP-2: A Micro-
Grained Massively Parallel Array Processor. In Eith Annual IEEE International
ASIC Conference and Exhibit, pages 333–337, April 1995.
334[GOK 92] Hiroyuki Goto, Hiroaki Ohkubo, Kenji Kondou, Masayoshi Ohkawa, Hitoshi
Mitani, Shinichi Horiba, Masakazu Soeda, Fumihiko Hayashi, Yutaro Hachiya,
Toshiyuki Shimizu, Manabu Ando, and Zensuke Matsuda. A 3.3-V 12-ns 16-Mb
SRAM. IEEEJournalofSolid-StateCircuits,27(11):1490–1496,November1992.
[Gol87] AlexGoldberger. AHigh Performance, Easyto ProgramDSPfor GeneralPurpose
Applications. InMini/MicroNortheastConferenceRecord, pages27/31–10, April
1987.
[Gra94] Jan Gray. homebuilt processors using FPGAs (long). December 11, 1994 posting
to comp.arch.fpga. Author may be reached at jsgray@ix.netcom.com,
December 1994.
[Gra96] Jan Gray. j32 FPGA Processor. Personal communications jsgray@ix.
netcom.com, February 1996.
[Gro87] Robert Grondalski. AVLSI ChipSet for Massively ParallelArchitecture. In IEEE
International Solid-State Circuits Conference, pages 198–199, 1987.
[GSNS92] Gensuke Goto, Tomio Sato, Masao Nakajima, and Takao Sukemura. A 54 54-
b Regularly Structured Tree Multiplier. IEEE Journal of Solid-State Circuits,
27(9):1229–1236, July 1992.
[HAH 92] Hideto Hidaka, KazutamiArimoto, KazutoshiHirayama, Masanori Hayashikoshi,
Mikio Asakura, Masaki Tsukude, Tsukasa Oishi, Shinji Kawai, Katsuhiro
Suma, Yasuhiro Konishi, Koji Tanaka, Wataru Wakamiya, Yoshikazu Ohno, and
Kazuyasu Fujishima. A 34-ns 16-Mb DRAM with Controllable Voltage Down-
Converter. IEEE Journal of Solid-State Circuits, 27(7):1020 ff., July 1992.
[Has87] Chuck Hastings. When is a Memory Not a Memory. In Proceedings of the
Electro/87 Mini/Micro Northeast, pages 1132, 4/5/1–18, 1987.
[Haw91] DavidHawley. Advanced PLD Architectures. InWill Moore and Wayne Luk, edi-
tors,FPGAs,pages11–23.AbingdonEE&CSBooks,15HarcourtWay, Abingdon,
OX14 1NV, UK, 1991.
[HBD94] Robert Heaton, DonaldBlevins, and Edward Davis. A Bit-Serial VLSI ArrayPro-
cessingChipforImageProcesing.IEEEJournalofSolid-StateCircuits,25(2):364–
368, April 1994.
[HDJ 88] Hung-Cheng Hsieh, Khue Duong, Jason Y. Ja, Roy Kanazawa, Luan T. Ngo,
Liane G. Tinkey, Ross H. Freeman, and William S. Carter. A 9000-Gate User-
ProgrammableGate Array. In IEEE 1988 Custom Integrated Circuits Conference,
pages 15.3.1–7. IEEE, May 1988.
[HFML85] Dennis A. Henlin, Michael T. Fertsch, Moshe Mazin, and Edard T. Lewis. A
16 16 Bit Pipelined Multiplier Macrocell. IEEE Journal of Solid-State Circuits,
20(2):542–547, April 1985.
335[HHC 87] Mark Horowitz, John Hennessy, Paul Chow, Glenn Gulak, John Acken, Anant
Agarwal, Chorng-Yeung Chu, Scott McFarling, Steven Przybylski, Steven
Richardson, Arturo Salz, Richard Simoni, Don Stark, Peter Steenkiste, Steven
Tjiang, and Malcom Wing. A 32b Microprocessor with On-Chip 2K byte Instruc-
tion Cache. In 1987 IEEE InternationalSolid-State Circuits Conference, Digest of
Technical Papers, pages 30–31. IEEE, February 1987.
[HKKM96] Makoto Hanawa, Kenji Kaneko, Tatsuya Kawashimo, and Hiroshi Maruyama. A
4.3 ns 0.3 m CMOS 54 54 Multiplier Using Precharged Pass-Transistor Logic.
In 1996 IEEE International Solid-State Circuits Conference, Digst of Technical
Papers, pages 364–365. IEEE, February 1996.
[HKM 90] Toshihiko Hirose, Hirotada Kuriyama, Shuji Murakami, Kojiro Yuzuriha, Takao
Mukai, Kazuhito Tsutsumi, Yasumasa, Nishimura, Yoshio Kohno, and Kenji
Anami. A 20-ns 4-Mb CMOS SRAM with Hierarchical Word Decoding Ar-
chitecture. IEEE Journal of Solid-State Circuits, 25(5):1068–1074, October 1990.
[HOW 86] Fumio Horiguchi, Mitsugi Ogura, Shigeyoshi Watanabe, Koji Sakui, Naokazu
Miyawaki, Yasuo Itoh, Kei Kurosawa, Fujio Masuoka, and Hisakazu Iizuka. A
High-Performance 1-Mbit Dynamic RAM with a Folded Capacitor Cell. IEEE
Journal of Solid-State Circuits, 21(6):1076–1082, December 1986.
[HP90] John Hennessey and David Patterson. Computer Architecture a Quantitative Ap-
proach. Morgan Kaufmann Publishers, Inc., 1990.
[HS84] Kye S. Hedlund and Lawrence Snyder. Systolic Architectures – A Wafer Scale
Approach. In Proceedings of the IEEE International Conference on Computer
Design: VLSI inComputers, pages604–610.IEEE,IEEEComputerSocietyPress,
October 1984.
[HT95] Hannes Hassler and Naofumi Takagi. Function Evaluation by Table Look-up and
Addition. In Proceedings of the 12th Symposium on Computer Arithmetic, pages
10–16, July 1995.
[ID95] Tsuyoshi Isshiki and Wayne Wei-Ming Dai. High-Level Bit-Serial Datapath Syn-
thesis for Multi-FPGA Systems. In Proceedings of the ACM/SIGDA International
SymposiumonField-ProgrammableGateArrays, pages167–173.ACM,February
1995.
[IIF 95] Hiroyuki Igura, Masanori Izumikawa, Koichiro Furuta, Tohru Mogami, Tadahiko
Horiuchi, and Masakazu Yamashina. 100MHz, 0.55mm2, 2mW, 16-b Stacked-
CMOS Multiplier-Accumulator. In Proceedings of the IEEE 1995 Custom Inte-
grated Circuits Conference, pages 597–600. IEEE, May 1995.
[IKM 94] Koichiro Ishibashi, Kunihiro Komiyaji, Sadayuki Morita, Toshiro Aoto, Shuji
Ikeda, Kyoichiro Asayama, Atsuyosi Koike, Toshiaki Yamanaka, Naotaka
Hashimoto, Haruhito Iida, Fumio Kojima, Koichi Motohashi, and Katsuro Sasaki.
336A12.5-ns16-Mb CMOSSRAMwith Common-Centroid-Geometry-LayoutSense
Ampliﬁers. IEEE Journal of Solid-State Circuits, 29(4):411 ff., April 1994.
[IYK 88] Michihiro Inoue, Toshio Yamada, Hisakazu Kotani, Hiroyuki Yamauchi, Atsushi
Fujiwara, Junko Matsushima, Hironori Akamatsu, Masanori Fukumoto, Masa-
fumi Kubota, Ichiro Nakao, Nobuo Aoi, Genshu Fuse, Shin-Ichi Ogawa, Shinji
Odanaka,AtsushiUeno,andHiroshiYamamoto.A16-MbitDRAMwithaRelaxed
Sense-Ampliﬁer-Pitch Open-Bit Line Architecture. IEEE Journal of Solid-State
Circuits, 23(5):1104–1112, October 1988.
[JF72] J. Robert Jump and Dennis R. Fritsche. Microprogrammed Arrays. IEEE Trans-
actions on Computers, 21(9):974–984, September 1972.
[JL95] DavidJones and David Lewis. A Time-MultiplexedFPGA Architecture for Logic
Emulation. In Proceedings of the IEEE 1995 Custom Integrated Circuits Confer-
ence, pages 495–498. IEEE, May 1995.
[JOSV95] Chris Jones, John Oswald, Brian Schoner, and John Villasenor. Issues in Wireless
Video Coding using Run-time-reconﬁgurable FPGAs. In Peter Athanas and Ken
Pocek, editors, Proceedings of the IEEE Workshop on FPGAs for Custom Com-
puting Machines, Los Alamitos, California, April 1995. IEEE Computer Society,
IEEE Computer Society Press.
[KAI 86] Yoshifumi Kobayashi, Kazutami Arimoto, Yuto Ikeda, Masahiro Hatanaka,
Koichiro Mashiko, Michihiro Yamad, and Takao Nakano. A High-Speed 64K 4
CMOS CRAM Using On-Chip Self-Timing Techniques. IEEE Journal of Solid-
State Circuits, 21(5):655–661, October 1986.
[KCE 85] Howard L. Kaltzer, Pierre D. Coppens, Wayne F. Ellis, John A. Fiﬁeld, Daryl J.
Kokoszka, Terry L. Leasure, Christopher P. Miller, Quan Nguyen, Ronald E.
Papritz, Charles S. Patton, J. Michael Poplawski, Jr., Steven W. Tomashot, and
Willem B. Van Der Hoeven. An Experimental 80-ns 1-Mbit DRAM with Fast
Page Operation. IEEE Journal of Solid-State Circuits, 20(5), October 1985.
[KDK 90] Yasuhiro Konishi, Katsumi Dosaka, Takahiro Komatsu, Yoshinori Inoue, Masaki
Kumanoya,YouichiTobita,HidekiGenjyo,MasaoNagatomo,andTsutomuYoshi-
hara. A 38-ns 4-Mb DRAM with A Battery-Backup (BBU) Mode. IEEE Journal
of Solid-State Circuits, 25(5):1112–1117, October 1990.
[KDK 92] Toshiaki Kirihata, Sang Dhong, Koji Kitamura, Toshio Sunaga, Yasunao
Katayama, Roy Scheuerlein, Akashi Satoh, Yoshinori Sakaue, Kentaroh Tobi-
matsu, Koji Hosokawa, Takaki Saitoh, Takefumi Yoshikawa, Hideki Hashimoto,
andMichiya Kazusawa. A 14-ns4-MbDRAMwith 300-mWActivePower. IEEE
Journal of Solid-State Circuits, 27(9):1222 ff., September 1992.
[KDS 96] Shinichi Kozu, Masayuki Daito, Yukinori Sugiyama, Hiroaki Suzuki, Hiroshi
Morita, Masahiro Nomura, Kouhei Nadehara, Souichiro Ishibuchi, Masako
337Tokuda, Yoshihisa Inoue, Takashi Nakayama, Hisao Harigai, and Yoichi Yano.
A 100MHz, 0.4W Processor with 200MHz Multiply-Adder, using Pulse-Register
Technique. In 1996 IEEE International Solid-State Circuits Conference, Digest of
Technical Papers, pages 140–141. IEEE, February 1996.
[Kea89] TomKean. ConﬁgurableLogic: ADynamicallyProgrammableCellular Architec-
ture and its VLSI Implementation. PhD thesis, University of Edinburgh, January
1989.
[KEK 85] YasuoKobayashi,HirotsugoEguchi,OsamuKudoh,ToshioHara, HideyukiOoka,
IsaoSasaki, Manabu Andoh, andMasato Tameda. A10- W StandbyPower 256K
CMOS SRAM. IEEE Journal of Solid-State Circuits, 20(5):935–940, October
1985.
[KFM 85] Masaki Kumanoya, Kazuyasu Fujishima, Hideshi Miyatake, Yasumasa,
Nishimura,Kazunori Saito, Takayuki, Matsukawa, TsutomuYoshihara, and Takao
Nakano. A Reliable 1-Mbit DRAM with Multi-Bit-Test Mode. IEEE Journal of
Solid-State Circuits, 20(5), October 1985.
[KFO84] Robert A. Kertis, Kerlly J. Fitzpatrick, and Kul B. Ohri. A 60 ns 256K 1 Bit
DRAM Using 3 Technology and Double-Level Metal Interconnection. IEEE
Journal of Solid-State Circuits, 19(5):585–590, October 1984.
[KHANW94] Alan Y. Kwentus, Hing-Tsun Hung, and Jr. Alan N. Wilson. An Architecture for
High-Performance/Small-Area Multipliers for Use in Digital Filtering Applica-
tions. IEEE Journal of Solid-State Circuits, 29(2):117–121, February 1994.
[KHK 93] GoroKitsukawa,MasashiHoriguchi,YoshikiKawajiri,TakayukiKawahara,Take-
sada Akiba, Yasushi Kawase, Toshikazu Tachibana, Takeshi Sakai, Masakazu
Aoki,Syoji Shukuri, KazuhikoSagara, Ryo Nagai,Yuzuru Ohji, NorioHasegawa,
Natsuki Yokoyama, Teruaki Kisu, Hisaomi Yamashita, Tokuo Kure, and Takashi
Nishida.256-MbDRAMCircuitTechnologiesforFileApplications. IEEEJournal
of Solid-State Circuits, 28(11):1105–1112, November 1993.
[KHN 96] MasuyoshiKurokawa,AkihikoHashiguchi,Ken’ichiroNakamura,HiroshiOkuda,
Koji Aoyama, Takao Yamazaki, Mitsuharu Ohki, Mitsuo Soneda, Katsunori
Seno, Ichiro Kumata, Masatoshi Aikawa, Hirokazu Hanaki, and Seiichiro Iwase.
5.4GOPS Linear Array Architecture DPS for Video-Format Conversion. In 1996
IEEE International Solid-State Circuits Conference, Digst of Technical Papers,
pages 254–255. IEEE, February 1996.
[KIK 86] Shinpei Kayano, Katsuki Ichinose, Yoshio Kohno, Hirofumi Shinohara, Kenji
Anami, Shuji Murakami, Tomohisa Wada, Yuji Kawai, and Yoichi Akasaka.
25-ns 256K 1/64K 4 CMOS SRAM’s. IEEE Journal of Solid-State Circuits,
21(5):686–691, October 1986.
338[KK79] StevenI.KartashevandSvetlanaP. Kartashev. AmulticomputerSystemswithDy-
namic Architecture. IEEE Transactions on Computers, 28(10):704–720, October
1979.
[KKHY88] ShojiKawahito, MichitakaKameyama,Tatsuo Higuchi,andHaruyaso Yamada. A
32 32-bit Multiplier Using Multiple-ValuedMOS Current-ModeCircuits. IEEE
Journal of Solid-State Circuits, 23(1):124–132, February 1988.
[KNK 87] KenjiKaneko,TetsuyaNakagawa,AtsushiKiuchi,YoshimuneHagiwara,Hirotada
Ueda,andHitoshiMatsushima. A50nsDSPwithParallelProcessingArchitecture.
In 1987 IEEE International Solid-State Circuits Conference, Digst of Technical
Papers, pages 158–159. IEEE, February 1987.
[Knu71] DonalE. Knuth. Empirical Study of FORTRAN Programs. Software Practice and
Experience, 1(1):105–133, 1971.
[Knu81] Donal E. Knuth. The Art of Computer Programming, volume 2. Addison Wesley,
Reading, Massachusetts, 2nd edition, 1981.
[KOT 96] Hideyuki Kabuo, Minoru Okamoto, Isao Tanaka, Hiroyuki Yasoshima, Shinichi
Marui, Masayuki Yamasaki, Toshio Sugimura, Katsuhiko Ueda, Toshihiro
Ishikawa, Hidetoshi Suzuki, and Ryuichi Asahi. An 80-MOPS-Peak High-Speed
Low-Power Consumption 16-b Digital Signal Processor. IEEE Journal of Solid-
State Circuits, 31(4):494–503, April 1996.
[KSB 90] Howard Kalter, Charles Stapper, John Barth, Jr., John DiLorenzo, Charles Drake,
John Fiﬁeld, Gordon Kelly, Jr., Soctt Lewis, Willem Van Der Hoeven, and James
Yankosky. A 50-ns 16-Mb DRAM with a 10-ns Data Rate and On-Chip ECC.
IEEE Journal of Solid-State Circuits, 25(5):1118 ff., October 1990.
[KSE 87] Katsutaka Kimura, Katsuhiro Shimohigashi, Jun Etoh, Masamichi Ishihara,
Kazuyuki Miyazawa, Shinji Shimizu, Yoshio Sakai, and Kunihiro Yagi. A 65-
ns4-Mbit CMOSDRAMwitha TwistedDrivelineSenseAmpliﬁer. IEEEJournal
of Solid-State Circuits, 22(5):651–656, October 1987.
[KSY 84] Hiroshi Kawamoto, Takashi Shinoda, Yasunori Yamaguchi, Shinji Shimizu, Kanji
Ohishi, Nobuyoshi Tnimura, and Tokumasa Yasui. A 288K CMOS Psedostatic
RAM. IEEE Journal of Solid-State Circuits, 19(5):619–623, October 1984.
[KT93] Won Kim and Russ Tuck. MasPar MP-2 PE Chip: A Totally Cool Hot Chip.
In Proceedings of Hot Chips V, MasPar Computer Corporation, 749 North Mary
Avenue, Sunnyvale, CA 94086, August 1993.
[KTO 87] Takaaki Komatsu, Hitoshi Taniguchi, Nobumichi Okazaki, Toshiyuki Nishihara,
Shigeki Kayama, Naoya Hoshi, Jun-Ichi Aoyama, and Takashi Shimada. A 35-ns
128K 8 CMOS SRAM. IEEE Journal of Solid-State Circuits, 22(5):721–726,
October 1987.
339[Kun82] H. T. Kung. Why Systolic Architectures? IEEE Computer, 15(1):37–46, January
1982.
[KWA 88] Yoshio Kohno, Tomohisa Wada, Kenji Anami, Yuji Kawai, Kojiro Yuzuriha,
Takayuki Matsukawa, and Shimpei Kayano. A 14-ns 1-Mbit CMOS SRAM with
VariableBitOrganization. IEEEJournalofSolid-StateCircuits, 23(5):1060–1066,
October 1988.
[LBK 89] Nicky Lu, Gary Bronner, Koji Kitamur, Roy Scheuerlein, Walter Henkels, Sang
Dhong, Yasunao Katayama, Toshiaki Kirihata, Hideto Niijima, Robert Franch,
Wei Hwang, Motoo Nishiwaki, Frank Pesavento, T. V. Rajeevakumar, Yoshinori
Sakaue, Yasusuke Suzuki, Yasunori Iguchi, and Eiji Yano. A 22-ns1-Mbit CMOS
High-Speed DRAM with Address Multiplexing. IEEE Journal of Solid-State
Circuits, 24(5):1198 ff., October 1989.
[LC95] Jianmin Li and Chung-Kuan Cheng. Routability Improvement Using Dynamic
Interconnect Architecture. In Peter Athanas and Ken Pocek, editors, Proceedings
of the IEEE Workshop on FPGAs for Custom Computing Machines, pages 61–67,
Los Alamitos, California, April 1995. IEEE Computer Society, IEEE Computer
Society Press.
[LCwH 88] Nicky Lu, Hu Chao, wei Hwang, Walter Henkels, T. V. Rajeevakumar, Hus-
sein Hanaﬁ, Lewis Terman, and Robert Franch. A 20-ns 128-kbit 4 High-
Speed DRAM with 330-Mbit/s Data Rate. IEEE Journal of Solid-State Circuits,
23(5):1140 ff., October 1988.
[LE94] Marianne E. Louie and Milos D. Ercegovac. A Variable Precision Multiplier for
FieldProgrammableGateArrays. InSecondInternationalACM/SIGDAWorkshop
on Field-Programmable Gate Arrays. ACM, February 1994. proceedings not
available outside of the workshop.
[LE96] Per Larsson-Edefors. A 965-Mb/s 1.o- m Standard CMOS Twin-Pipe Se-
rial/Parallel Multiplier. IEEE Journal of Solid-State Circuits, 31(2):230–239,
February 1996.
[Lei79] Charles Leiserson. Systolic Priority Queues. CMU-CS-TR 115, Carnegie-Mellon
University, Pittsbugh, Pennsylvania 15213, April 1979.
[Lev77] Lance Leventhal. Cut Your Processor’s Computation Time. Electronic Design,
25(17):82–88, August 16 1977.
[LGC84] Claude P. Lerouge, Pierre Girard, and Jo¨ el S. Colardelle. A Fast 16 Bit Parallel
Multiplier. IEEE Journal of Solid-State Circuits, 19(3):338–342, June 1984.
[LGS87] Josephy Y. Lee, Hugh L. Garvin, and Charles W. Slayman. A High-Speed High-
Density Silicon 8 8-bit Parallel Multiplier. IEEE Journal of Solid-State Circuits,
22(1):35–40, February 1987.
340[LLNK96] Jon Lotz, Gregg Lesartre, Samuel Naffziger, and Don Kipp. A Quad-Issue Out-
of-OrderRISC CPU. In 1996 IEEE International Solid-StateCircuits Conference,
Digest of Technical Papers, pages 210–211. IEEE, February 1996.
[LR71] B. S. Landman and R. L. Russo. On Pin Versus Block Relationship for Partitions
of Logic Circuits. IEEE Transactions on Computers, 20:1469–1479, 1971.
[LRSS84] Chris Lutz, Steve Rabin, Chuck Seitz, and Don Speck. Design of the MOSAIC
Element. In Paul Penﬁeld, Jr., editor, Proceedings, Conference on Advanced
Research in VLSI, pages 1–10, Cambdrige, MA, January 1984.
[LS90] Junien Labrousse and Gerrit Slavenburg. A 50MHz Microprocessor with a Very
Long Instruction Word Architecture. In 1990 IEEE International Solid-State Cir-
cuits Conference, Digest of Technical Papers, pages 44–45. IEEE, February 1990.
[LS92] Joe Laskowski and Henry Samueli. A 150-MHz 43-Tap Half-Band FIR Digital
Filter in 1.2- m CMOS Generated by Silicon Compiler. In Proceedings of the
IEEE 1992 Custom Integrated Circuits Conference, pages 11.4.1–11.4.4. IEEE,
May 1992.
[LS93] Fang Lu and Henry Samueli. A 200-MHz CMOS Pipelined Multiplier-
Accumulator Using a Quasi-Domino Dynamic Full-Adder Cell Design. IEEE
Journal of Solid-State Circuits, 28(2):123–132, February 1993.
[Mal94] Lisa Maliniak. Hardware Emulation Draws Speed From Innovative 3D Parallel
Processing Based on Custom ICs. Electronic Design, pages 38–41, May 30 1994.
[MD96] Ethan Mirsky and Andr´ e DeHon. MATRIX: A Reconﬁgurable Computing Ar-
chitecture with Conﬁgurable Instruction Distribution and Deployable Resources.
In Proceedings of the IEEE Workshop on FPGAs for Custom Computing Ma-
chines, April 1996. Anonymous FTP transit.ai.mit.edu:papers/
matrix-fccm96.ps.Z.
[Min67] Robert C. Minnick. A Survey of Microcellular Research. Journal of the ACM,
14(2):203–241, April 1967.
[Min71] Robert Minnick. A Programmable Cellular Array. In Fifth Annual IEEE Interna-
tional Computer Society Conference: Hardware Software Firmware Trade-Offs,
pages 25–26. IEEE, September 1971.
[Mir96] Ethan Mirsky. Course-Grain Reconﬁgurable Computer. Master’s thesis, Mas-
sachusetts Institute of Technology, 545 Technology Sq., Cambridge, MA
02139, June 1996. Anonymous FTP transit.ai.mit.edu:papers/
eamirsky-matrix-meng.ps.Z.
[MKM 84] Koichiro Mashiko, Toshifumi Kobayashi, Hiroshi Miyamoto, Kazutami Arimoto,
Yoshikazu Morooka, Masahiro Hatanaka, Michihiro Yamada, and Takao Nakano.
A 70 ns 256K DRAM with Bit-Line Shield. IEEE Journal of Solid-State Circuits,
19(5), October 1984.
341[MKS 84] Amr Mohsen, Roger I. Kung, Carl J. Simonsen, Joseph Schutz, Paul D. Madland,
Esmatz Z. Hamdy, and Mark T. Bohr. The Design and Performance of CMOS
256K Bit DRAM Devices. IEEE Journal of Solid-State Circuits, 19(5):610–620,
October 1984.
[MKS 92] Masato Matsumiya, Shoichiro Kawashima, Makoto Sakata, Masahiko Ookura,
Toru Miyabo, Toru Koga, Kazuo Itabashi, Kazuhiro Mizutani, Hiroshi Shimada,
and Noriyuki Suzuki. A 3.3-V 12-ns 16-Mb SRAM. IEEE Journal of Solid-State
Circuits, 27(11):1497–1503, November 1992.
[MMK 89] FumioMiyaji,YasushiMatsuyama,YoshikazuKanaishi,KatsunoriSenoh,Takashi
Emori,and Yoshiaki Hagiwara. A25-ns 4-MbitCMOS SRAMwith DynamicBit-
Line Loads. IEEE Journal of Solid-State Circuits, 24(5):1213–1218, October
1989.
[MMM 91] Shigeru Mori, Hiroshi Miyamoto, Yoshikazu Morooka, Shigeru Kikuda, Makoto
Suwa, Mitsuya Kinoshita, Atsushi Hachisuka, Hideaki Arima, Michihiro Yamada,
Tsutomu Yoshihara, and Shimpei Kayano. A 45-ns 64-Mb DRAM with a Merged
Match-LineTestArchitecture. IEEEJournal of Solid-StateCircuits,26(11):1486–
1492, November 1991.
[MMN 90] Jiro Miyake, Toshinori Maeda, Yoshito Nishimichi, Joji Katsura, Takashi
Taniguchi, Seiji Yamaguchi, Hisakazu Edamatsu, Shigeru Watari, Yoshiyuki Tak-
agi, Kazuhiko Tsuji, Shigeo Kuninobu, Steve Cox, Douglas Duschatko, and Dou-
glasMacGregor. A40 MIPS(Peak) 64-bit Microprocessorwith One-ClockPhysi-
calCacheLoad/Store. In1990IEEEInternationalSolid-StateCircuitsConference,
Digest of Technical Papers, pages 42–43. IEEE, February 1990.
[MMS 84] Osamu Minato, Toshiaki Masuhara, Toshio Sasaki, Keizo Matsumoto, Yoshio
Sakai, Tetsuya, and Hayashida. A 20 ns 64K CMOS Static RAM. IEEE Journal
of Solid-State Circuits, 19(6), October 1984.
[MNA 87] Koichiro Mashiko, Masao Nagatomo, Kazutami Arimoto, Yoshio Matsuda, Kiy-
ohiroFurutani, TakayukiMatsukawa, MichihiroYamada, TsutomuYoshihara, and
TakaoNakano. A4-MbitDRAMwithFolded-Bit-LineAdaptiveSidewall-Isolated
Capacitor(FASIC) Cell. IEEEJournalofSolid-StateCircuits,22(5):643–650,Oc-
tober 1987.
[MNH 91] Junji Mori, Masato Nagamatsu, Masashi Hirano, Shigeru Tanaka, Makoto Noda,
Yoshiaki Yoyoshima, Kazuhiro Hashimoto, Hiroyuki Hayashida, and Kenji
Maeguchi. A 10-ns 54 54-bParallel StructuredFullArray Multiplierwith 0.5 m
CMOS Technology. IEEE Journal of Solid-State Circuits, 26(4):600–606, April
1991.
[MNS 96] Hiroshi Makino, Yasunobu Nakase, Hiroaki Suzuki, Hiroyuki Morinaka, Hiro-
fumi Shinohara, and Koichiro Mashiko. An 8.8-ns 54 54-Bit Multiplier with
342High SpeedRedundant Binary Architecture. IEEE Journal of Solid-StateCircuits,
31(6):773–783, June 1996.
[MOT 87] Masataka Matsui, Takayuki Ohtani, Jun-Ichi Tsujimoto, Hiroshi Iwai, Azuma
Suzuki, Katsuhiko Sato, Mitsuo Isobe, Kazuhiko Hashimoto, Mitsuchika Saitoh,
HidekiShibata,HisayoSasaki,TadashiMatsuno,Jun-IchiMatsunaga,andTetsuya
Iizuka. A25-ns1-MbitCMOSSRAMwith Loading-FreeBitLines. IEEEJournal
of Solid-State Circuits, 22(5):733–740, October 1987.
[MSM 84] Jun-Ichi Miyamoto, Shinji Saito, Hiroshi Momose, Hideki Shibata, Koichi Kan-
zaki, and Tetsuya Iizuka. A High-Speed 64K CMOS RAM with Bipolar Sense
Ampliﬁers. IEEE Journal of Solid-State Circuits, 19(5):557–564, October 1984.
[MWA 96] JamesMontanaro,RichardWitek,KrishnaAnne,AndrewBlack,ElizabethCooper,
Dan Dobberpuhl, Paul Donahure, Jim Eno, Alejandro Farell, Gregory Hoeppner,
David Kruckemyer, Thomas Lee, Peter Lin, Liam Madden, Daniel Murray, Mark
Pearce,SribalanSanthanam,KathrynSnyder, RayStephany,andStephenThierauf.
A 160MHz 32b 0.5W CMOS RISC Microprocessor. In 1996 IEEE International
Solid-StateCircuitsConference,DigestofTechnicalPapers,pages210–211.IEEE,
February 1996.
[MYM 87] Hiroshi Miyamoto, Tadato Yamagata, Shigeru Mori, Toshifumi Kobayashi, Shin-
Ichi Satoh, and Michihiro Yamada. A Fast 256K 4 CMOS DRAM with Dis-
tributed Sense and Unique Restore Circuit. IEEE Journal of Solid-State Circuits,
22(5):861–867, October 1987.
[MYO 96] Hiroaki Murakami, Naoka Yano, Yukio Ootaguro, Yukio Sugeno, Maki Ueno,
Yukinori Muroya, and Tsuneo Aramaki. A Multiplier-Accumulator Macro for
a 45 MIPS Embedded RISC Processor. IEEE Journal of Solid-State Circuits,
31(7):1067–1071, July 1996.
[NHK95] Kouhei Nadehara, Miwako Hayashida, and Ichiro Kuroda. A Low-Power, 32-bit
RISCProcessorwithSignalProcessingCapabilityandits Multiply-Adder,volume
VIII of VLSI Signal Processing, pages 51–60. IEEE, 1995.
[Nic90] John Nickolls. The Design of the MasPar MP-1: A Cost Effective Massively
Parallel Computer. In Compcon Spring 90, pages 25–28. IEEE, 1990.
[NNO 91] Takeshi Nagai, Kenji Numata, Masaki Ogihara, Mitsuru Shimizu, Kimimasa
Imai, Takahiko Hara, Munehiro Yoshida, Yoshikazu Saito, Yoshiaki Asao, Shizuo
Sawada, and Syuso Fujii. A 17-ns 4-Mb DRAM. IEEE Journal of Solid-State
Circuits, 26(11):1538 ff., November 1991.
[NSLKE86] Tobias G. Noll, Doris Schmitt-Landsiedel, Heinrich Klar, and Gerhard Enders. A
Pipelined 300-MHz Multiplier. IEEE Journal of Solid-State Circuits, 21(3):411–
416, June 1986.
343[NSS 86] Kazutaka Nogami, Takayasu Sakurai, Kazuhiro Sawada, Tetsunori Wada, Kat-
suhikoSato,MitsuoIsobe, MasakazuKakumu,ShigeruMorita,ShunjiYokogawa,
Masaaki Kinugawa, Tetsuya Asami, Kazuhiko Hashimoto, Jun-Ichi Matsunaga,
Hiroshi Nozawa, and TetsuyaIizuka. 1-Mbit Virtually Static RAM. IEEE Journal
of Solid-State Circuits, 21(5):662–668, October 1986.
[NTT 91] Yoshinobu Nakagome,Hitoshi Tanaka, Kan Takeuchi,Eiji Kume, Yasushi Watan-
abe, Toru Kaga, Yoshifumi Kawamoto, Fumio Murai, Ryuichi Izawa, Digh
Hisamoto, Teruaki Kisu, TakashiNishida, Eiji Takeda, and Kiyoo Itoh. An Exper-
imental 1.5-V 64-Mb DRAM. IEEE Journal of Solid-State Circuits, 26(4):465 ff.,
April 1991.
[Nut77] Gary J. Nutt. Microprocessor Implementation of a Parallel Processor. In Proceed-
ings of the Fourth Annual International Symposium on Computer Architecture,
pages 147–152. ACM, 1977.
[OFW 87] Takashi Ohsawa, Tohru Furuyama, Yohji Watanabe, Hiroto Tanaka, Natsuki
Kushiyama, Kenji Tsuchida, Yohsei Nagahama, Satoshi Yamano, TakeshiTanaka,
SatoshiShinozaki, and Kenji Natori. A 60-ns 4-Mbit CMOSDRAM with Built-In
Self-Test Function. IEEE Journal of Solid-State Circuits, 22(5):663–668, October
1987.
[OHK 90] TakayukiOotani,ShigeyukiHayakawa,MasakazuKakumu,AkiraAono,Masaaki
Kinugawa, Hideki Takeuchi, Kazuhiro Noguchi, Tomoaki Yabe, Katsuhiko Sato,
KenjiMaeguchi, andKiyofumi Ochii. A4-Mb CMOSSRAMwith a PMOS Thin-
Film-TransistorLoadCell.IEEEJournalofSolid-StateCircuits,25(5):1082–1091,
October 1990.
[OKH 84] Nobumichi Okazaki, Takaaki Komatsu, Naoya Hoshi, Kunihiko Tsuboi, and
Takashi Shimada. A 16 ns 2K 8 Full CMOS SRAM. IEEE Journal of Solid-
State Circuits, 19(5):552–556, October 1984.
[ONN 88] HiroakiOkuyama,TakeshiNakano,ShuichiNishida,EtsuroAono,HisahiroSatoh,
and Shigeru Arita. A 7.5-ns 32K 8 CMOS SRAM. IEEE Journal of Solid-State
Circuits, 23(5):1054–1059, October 1988.
[OSS 95] Norio Ohkubo, Makoto Suzuki, Toshinobu Shinbo, Toshiaki Yamanaka, Akihiro
Shimizu, Katsuro Sasaki, and Yoshinobu Nakagome. a 4.4 ns CMOS 54 54-b
MultiplierUsingPass-TransistorMultiplexer.IEEEJournalofSolid-StateCircuits,
30(3):251–257, February 1995.
[OTW 91] Yukihito Oowaki, Kenji Tsuchida, Yohji Watanabe, Daisaburo Takashima,
Masako Ohita, Hiroaki Nakano, Shigeyoshi Watanabe, Akihiro Nitayama, Fu-
mio Horiguchi, Kazunori Ohuchi, and Fujio Masuoka. A 33-ns 64-Mb DRAM.
IEEE Journal of Solid-State Circuits, 26(11):1498–1505, November 1991.
[Ple90] Plessey Semiconductors, Cheney Manor, Sindown, Wiltshire SN2 2QW, UK.
ERA60100 Datasheet – Electrically Reconﬁgurable Array, May 1990.
344[PML 89] A. Picco, J. C. Michalina, B. Laurier, D. Fuin, P. Menut, and JL. Laborie. The
ST18940/41: AnAdvanced Single-chipDigital SignalProcessors. InProceedings
of the 1989 IEEE International Symposium on Circuits and Systems, pages 1559–
1562. IEEE, May 1989.
[QC88] Le Quach and Richard Chueh. CMOS Gate Array Implementation of SPARC. In
Digest of Papers COMPCON’88, pages 14–17. IEEE, Februrary 1988.
[Ram93] Rambus Inc. Architectural Overview. Produce Literature, 1993. Rambus Inc.,
2465 Latham Steet, Mountain View, CA 94040.
[Raz94] Rahul Razdan. PRISC: Programmable Reduced Instruction Set Computers. PhD
thesis,HarvardUniveristy,May 1994. AnonymousFTPftp.eecs.harvard.
edu:users/smith/theses/razdan-thesis.tar.gz.
[RB91] Jonathan Rose and Stephen Brown. Flexibility of Interconnection Structures
for Field-Programmable Gate Arrays. IEEE Journal of Solid-State Circuits,
26(3):277–282, March 1991.
[RDB 94] Ehsan Rashid, Eric Delano, Michael Buckley, Jason Zheng, Francis Schumacher,
Gordon Kurpanek, John Shelton, Tom Alexander, Nazeem Noordeen, Mark Lud-
wig, Alisa Scherer, Chaim Amir, Dan Cheung, Prasad Sabada, Ram Rajamani,
Nick Fiduccia, Bill Ches, Kamyar Eshghi, Fred Eatock, Denny Renfrow, John
Keller, Paul Ilgenfrizt, Ilan Krashinsky, Darryl Weatherspoon, Shrikant Ranade,
Dave Goldberg, and William Byrg. A CMOS RISC CPU with On-Chip Parallel
Cache. In 1994 IEEE International Solid-State Circuits Conference, Digest of
Technical Papers, pages 210–211. IEEE, February 1994.
[RFLC90] Jonathan Rose, Robert Francis, David Lewis, and Paul Chow. Architecture of
Field-Programmable Gate Arrays: The Effect of Logic Block Functionality on
Area Efﬁciency. IEEE Journal of Solid-State Circuits, 25(5):1217–1225, October
1990.
[RK92] Dirk Reuver and Heinrich Klar. A Conﬁgurable Convolution Chip with Pro-
grammable Coefﬁcients. IEEE Journal of Solid-State Circuits, 27(7):1121–1123,
July 1992.
[RPJ 84] ChristopherRowen, Steven Przbylski,NormanJouppi, Thomas Gross, JohnShott,
and John Hennessey. A Pipelined 32b NMOS Microprocessor. In 1984 IEEE
International Solid-State Circuits Conference, Digest of Technical Papers, pages
180–181. IEEE, February 1984.
[RS92] PoornachandraB. Rao and Alexander Skavantzos. New Multiplier Designs Based
on Squared Law Algorithms and Table Look-ups. In Conference Record of the
Twenty-Sixth Asilomar Conference on Signals, Systems and Computers (volume
2), pages 686–690, October 1992.
345[RS94] Rahul Razdan and Michael D. Smith. A High-Performance Microarchitecture
with Hardware-Programmable Functional Units. In Proceedings of the 27th An-
nual International Symposium on Microarchitecture, pages 172–180. IEEE Com-
puter Society, November 1994. Anonymous FTP ftp.eecs.harvard.edu:
users/smith/papers/micro94.ps.gz.
[RSV87] R. Rudell and A. Sangiovanni-Vincentelli. Multiple-Valued Minimization for
PLA Optimization. IEEE Transactions on Computer-Aided Design of Integrated
Circuits, 6(5):727–751, September 1987.
[Rue89] PeterRuetz. The ArchitecturesandDesignof a 20-MHzReal-Time DSP ChipSet.
IEEE Journal of Solid-State Circuits, 24(2):338–348, April 1989.
[SA90] Chip Sterns and Peng Ang. Yet Another Multiplier Architecture. In Proceedings
of the IEEE 1990 Custom Integrated Circuits Conference, pages 24.6.1–4. IEEE,
May 1990.
[SAI 85] Hirofumi Shinohara, Kenji Anami, Katsuki Ichinose, Tomohisa Wada, Yoshio
Kohno, Yuji Kawai, Yoichi Akasaka, and Shinpei Kayano. A 45-ns 256K CMOS
StaticRAMwithTri-LevelWordLine. IEEEJournalofSolid-StateCircuits,20(5),
October 1985.
[Sch71] Mario R. Schaffner. A System with Programmable Hardware. In Fifth Annual
IEEE International Computer Society Conference: Hardware Software Firmware
Trade-Offs, pages 17–18. IEEE, September 1971.
[Sch78] Mario R. Schaffner. Processing by Data and Program Blocks. IEEE Transactions
on Computers, 27(11):1015–1028, November 1978.
[SCLB84] Stanley E. Schuster, Barbara Chappell, Victor Di Lonardo, and Peter E. Britton. A
20 ns 64K (4K 16) NMOS RAM. IEEE Journal of Solid-State Circuits, 19(5),
October 1984.
[Sei92] Charles L. Seitz. Mosaic C: An Experimental Fine-Grain Multicomputer. In
A. Bensoussan and J.-P. Verjus, editors, Future Tendencies in Computer Science,
Control and Applied Mathematics: Internantional Conference on the Occasion of
the 25th Anniversary of INRIA, pages 69–85. Sprinter-Verlag, December 1992.
[Seo94] Soon Ong Seo. A High Speed Field-Programmable Gate Array Using Pro-
grammable Minitiles. Master’s thesis, University of Toronto, Ontario, Canada,
1994.
[SFO 85] ShozoSaito, SyusoFujii, YoshioOkada, ShizuoSawada,SatoshiShinozaki, Kenji
Natori, and Osamo Ozawa. A 1-Mbit CMOS DRAM with Fast Page Mode and
Static Column Mode. IEEE Journal of Solid-State Circuits, 20(5), October 1985.
[SGS 85] Lal C. Sood, James S. Golab, John Salter, John E. Leiss, and John J. Barnes. A
Fast 8K 8 CMOS SRAM With Internal Power Down Design Techniques. IEEE
Journal of Solid-State Circuits, 20(5):941–950, October 1985.
346[SH89] Mark R. Santoro and Mark A. Horowitz. SPIM: A Pipelined 64 64-bit Iterative
Multiplier. IEEE Journal of Solid-State Circuits, 24(2):487–493, April 1989.
[SHU 88] Katsuro Sasaki, Shoji Hanamura, Kiyotsugo Ueda, Takao Oono, Osamu Mi-
nato, Yoshio Sakai, Satoshi Meguro, Masayoshi Tsunematsu, Toshiaki Masuhara,
MasaakiKubotera, and Hiroshi Toyoshima. A 15-ns 1-Mbit CMOSSRAM. IEEE
Journal of Solid-State Circuits, 23(5):1067–1073, October 1988.
[SIS 90] Katsuro Sasaki, Koichiro Ishibashi, Katsuhiro Shimohigashi, Toshiaki Yamanaka,
Nobuyuki Moriwaki, Shigeru Honjo, Shuji Ikeda, Atsuyoshi Koike, Satoshi Me-
guro, and Osamu Minato. A 23-ns 4-Mb CMOS SRAM with 0.2- A Standby
Current. IEEE Journal of Solid-State Circuits, 25(5):1075–1081, October 1990.
[SIU 92] KatsuroSasaki, KoichiroIshibashi,KiyotsugoUeda, KunihiroKomiyaji,Toshiaki
Yamanaka, Naotaka Hashimoto, Hiroshi Toyoshima, Fumio Kojima, and Akihiro
Shimizu. A 7-ns 140-mW 1-Mb CMOS SRAM with Current Sense Ampliﬁer.
IEEE Journal of Solid-State Circuits, 27(11):1511–1518, November 1992.
[SIY 89] Katsuro Sasaki, Koichiro Ishibashi, Toshiaki Yamanaka, Naotaka Hashimoto,
Takashi Nishida, Katsuhiro Shimohigashi, Shoji Hanamura, and Shigeru Honjo.
A 9-ns 1-Mbit CMOS SRAM. IEEE Journal of Solid-State Circuits, 24(5):1219–
1225, October 1989.
[SJ88] Naresh R. Shanbhag and Pushkal Juneja. Parallel Implementation of a 4 4 Mul-
tiplier Using Modiﬁed Booth’s Algorithm. IEEE Journal of Solid-State Circuits,
23(4):1010–1013, August 1988.
[SKI 88] HiroshiShimada,ShoichiroKawashima,HideoItoh,NoriyukiSuzuki,andTakashi
Yabu. A 45-ns 1-Mbit CMOS SRAM. IEEE Journal of Solid-State Circuits,
23(1):53–58, February 1988.
[SKK 91] Katsuyuki Sato, Kanehide Kenmizaki, Shoji Kubono, Toshio Mochizuki,
Hidetomo Aoyagi, Michitaro Kanamitsu, Soichi Kunito, Hiroyuki Uchida, Yoshi-
hiko Yasu, Atsushi Ogishima, Sho Sano, and Hiroshi Kawamoto. A 4-Mb Pseudo
SRAM Operating at 2.6 1V with 3- A Data Retention Current. IEEE Journal of
Solid-State Circuits, 26(11):1556–1561, November 1991.
[SKPS84] Robert Sherburne, Jr., Manolis Katevenis, David Patterson, and Carlo Sequin. A
32bNMOSMicroprocessorwithaLargeRegisterFile. In1984IEEEInternational
Solid-StateCircuitsConference,DigestofTechnicalPapers,pages168–169.IEEE,
February 1984.
[SKS 93] Katsunori Seno, Kurt Knorpp, Lee-Lean Shu, Naoki Teshima, Hiroki Kihara, Hi-
roshiSato,FumioMiyaji,MinoruTakeda,MasayoshiSasaki,YoichiTomo,Patrick
Chuang, and Kazuyoshi Kobayashi. A 9-ns 16-Mb CMOS SRAM with Offset-
Compenstated Current Sense Ampliﬁer. IEEE Journal of Solid-State Circuits,
28(11):1119–1124, November 1993.
347[SKYH92] M. Shiraishi, M. Koizumi, A. Yamaguchi, and H. Hoike. User Programmable
16Bit 50ns DSP. In Proceedings of the IEEE 1992 Custom Integrated Circuits
Conference, pages 6.4.1–6.4.4. IEEE, May 1992.
[Sla95] MichaelSlater. MicroUnityLiftsVeilonMediaProcessor. MicroprocessorReport,
9(14):11ff., October 23 1995. http://www.chipanalyst.com/report/
report9_14/page11.html.
[SLM 89] Ramautar Sharma, Alexander D. Lopez, John A. Michejda, Steven J. Hillenius,
John M. Andrews, and Arnold J. Studwell. A 6.75-ns 16 16-bit Multiplier in
Single-Level-Metal CMOS Technology. IEEE Journal of Solid-State Circuits,
24(4):922–927, August 1989.
[SMI 84] Takayasu Sakurai, Junichi Matsunaga, Mitsuo Isobe, Takayuki Ohtani, Kazuhiro
Sawada, Akira Aono, Hiroshi Nozawa, Tetsuya IIzuka, and Susumu Kohyama. A
Low Power 46 ns 256 kbit CMOS Static RAM with Dynamic Double Word Line.
IEEE Journal of Solid-State Circuits, 19(5):578–584, October 1984.
[SMK 94] Toshio Sunaga, Hisatada Miyatake, Koji Kitamura, Keishi Kasuya, Takaki Saitoh,
MasahiroTanaka,NorioTanigaki,Yohtaro Mori,andNoritoshiYamasaki. DRAM
Macros for ASIC Chips. IEEE Journal of Solid-State Circuits, 30(9):1006–1014,
September 1994.
[SNT 84] Shun’ishiSuzuki, Masumi Nakao,Toshio Takeshima, MasaakiYoshida, Masanori
Kikuchi, Kunio Nakamura, Takeshi Mizukami, and Masayuki Yanagisawa. A
128K 8 Bit Dynamic RAM. IEEE Journal of Solid-State Circuits, 19(5):624–
626, October 1984.
[Sny85] Lawrence Snyder. An Inquiry into the Beneﬁts of Multigauge Parallel Computa-
tion. In Proceedings of the1985 InternationalConference on ParallelProcessing,
pages 488–492. IEEE, August 1985.
[SPA 95] Gene Shen, Niteen Patkar, Hisashige Ando, David Chang, Charles Chen, Chien
Chen, Frank Chen, Per Forssell, John Gmuender, Takeshi Kitahara, Hungwen Li,
David Lyon, Robert Montoye, Leon Peng, Sunil Savkar, Jonathan Sherred, Mike
Simone, Ravi Swami, DeFroset Tovey, and Ted Williams. A 64b 4-Issue Out-of-
OrderExecution RISC Processor. In 1995 IEEE International Solid-State Circuits
Conference, Digest of Technical Papers, pages 170–171. IEEE, February 1995.
[SSL 92] Ellen M.Sentovich, KanwarJit Singh, LucianoLavagno, Cho Moon, RajeevMur-
gai, Alexander Saldanha, Hamid Savoj, Paul R. Stephan, Robert K. Brayton, and
Alberto Sangiovanni-Vincentelli. SIS: A System for Sequential Circuit Synthesis.
UCB/ERL M92/41, University of California, Berkeley, Department of Electrical
EngineeringandComputerScience,UniversityofCalifornia,Berkeley, CA94720,
May 1992.
348[SSN 92] Akinori Sekiyama,Teruo Seki, Shinji Nagai, Akihiro Iwase, NoriyukiSuzuki, and
Masato Hayasaka. A 1-V Operating 256-kb Full-CMOS SRAM. IEEE Journal of
Solid-State Circuits, 27(5):776–782, May 1992.
[STN 93] TadahikoSugibayashi, Toshio Takeshima, Isao Naritake, Tatsuya Matano, Hiroshi
Takada, Yoshiharu Aimoto, Koichiro Furuta, Mamoru Fujita, Takanori Saeki,
Hiroshi Sugawara, Tatsunori Murotani, Naoki Kasai, Kentaro Shibahara, Ken
Nakajima, Hiromitsu Hada, Takehiko Hamada, Naoaki Aizaki, Takemitsu Kunio,
EiichiroKakehashi, Katsuhiro Masumori,andTakaho Tanigawa. A 30-ns256-Mb
DRAMwitha MultidividedArrayStructure. IEEEJournal ofSolid-StateCircuits,
28(11):1092–1098, November 1993.
[STT 88] Hiroshi Shimada, Yoshinao Tange, Kazuo Tanimoto, Michio Shiraishi, Noriyuki
Suzuki, and Toshio Nomura. An 18-ns 1-Mbit CMOS SRAM. IEEE Journal of
Solid-State Circuits, 23(5):1073–1077, October 1988.
[SUT 93] Katsuro Sasaki, Kiyotsugu Ueda, Koichi Takasugi, Hiroshi Toyoshima, Koichiro
Ishibashi, Toshiaki Yamanaka, Naotaka Hashimoto, and Nagatoshi Ohki. A 16-
Mb CMOS SRAM with a 2.3 m2 Single-Bit-Line Memory Cell. IEEE Journal of
Solid-State Circuits, 28(11):1125–1130, November 1993.
[SV93] Dinesh Somasekhar and V. Visvanathan. A 230-MHz Half-Bit Level Pipelinined
Multiplier Using True Single-Phase Clocking. IEEE Transactions on Very Large
Scale Integration (VLSI) Systems, 1(4):415–422, December 1993.
[SYN 94] Kazumasa Suzuki, Masakazu Yamashina, Takashi Nakayama, Masanori Izu-
mikawa, Masahiro Nomura, Hiroyuki Igura, Hideki Heiuchi, Junichi Goto, Toshi-
aki Inoue, Youichi Koseki, Hitoshi Abiko, Kazuhiro Okabe, Atsuki Ono, Youichi
Yano, and Hachiro Yamada. A 500MHz 32b 0.4 m CMOS RISC Processor LSI.
In 1994 IEEE International Solid-State Circuits Conference, Digest of Technical
Papers, pages 214–215. IEEE, February 1994.
[TEC 95] Edward Tau, Ian Eslick, Derrick Chen, Jeremy Brown, and Andr´ e DeHon. A First
Generation DPGA Implementation. In Proceedings of the Third Canadian Work-
shop on Field-Programmable Devices, pages 138–143, May 1995. Anonymous
FTP transit.ai.mit.edu:papers/dpga-proto-fpd95.ps.Z.
[TFT 85] Yoshihisa Takayama, Shigeru Fujii, Tomoaki Tanabe, Kazuyuki Kawauchi, and
Toshihiko Yoshida. A 1ns 20K CMOS Gate Array Series with Conﬁgurable 15ns
12KMemory. In 1985 IEEE InternationalSolid-StateCircuits Conference, Digest
of Technical Papers, pages 196–197. IEEE, February 1985.
[TJ85] Ronald T. Taylor and Mark G. Johnson. A 1-Mbit CMOS Dynamic RAM with a
Divided Bitline Matrix Architecture. IEEE Journal of Solid-State Circuits, 20(5),
October 1985.
349[TLB 90] Darius Tansalvala, Joel Lamb, Michael Buckley, Bruce Long, Sean Chapin,
Jonathan Lotz, Eric Delano, Richard Luebs, Keith Erskine, Scott McMullen,
Mark Forsyth, Robert Novak, Tony Gaddis, Doug Quarnstrom, Craig Gleason,
Ehsan Rashid, Daniel Halperin, Leon Sigal, Harlan Hill, Craig Simpson, David
Hollenbeck,John Spencer, Robert Horning, HoangTran, ThomasHotchkiss, Dun-
can Weir, Donald Kipp, John Wheeler, Patrick Knebel, Jeffery Yetter, and Charles
Kohlhardt.A15MIPS32bMicroprocessor. In1990IEEEInternationalSolid-State
Circuits Conference, Digest of Technical Papers, pages 52–53. IEEE, February
1990.
[TNH 96] Toshinari Takayanagi, Kazutaka Nogami, Fumitoshi Hatori, Naoyuki Hatanaka,
Makoto Takahashi, Makoto Ichida, Shinji Kitabayashi, Tatsuya Higashi, Mike
Klein, John Thomson, Roger Carpenter, Ravi Donthi, Denny Renfrow, Jason
Zheng, Liane Tinkey, Brandi Maness, Jim Battle, Steve Purcell, and Takayasu
Sakurai. 350MHz Time-Multiplexed 8-port SRAM and Word-Size Variable Mul-
tiplier for Multimedia DSP. In 1996 IEEE International Solid-State Circuits Con-
ference, Digst of Technical Papers, pages 150–151. IEEE, February 1996.
[TNK 94] Yasuhiro Takai, MamoruNagase, Mamoru Kitamura, Yasuji Koshikawa, Naoyuki
Yoshida,YasuakiKobayashi,TakashiObara,YukioFukuzo,andHiroshiWatanabe.
250 Mbytes/s Synchronous DRAM Using a 3-Stage-PipelinedArchitecture. IEEE
Journal of Solid-State Circuits, 29(4):426–431, April 1994.
[TTK 90] Toshio Takeshima, Masahide Takada, Hiroki Koike, Hiroshi Watanabe, Shigeru
Koshimaru, Kenjiro Mitake, Wataru Kikuchi, Takaho Tanigawa, Tatsunori
Murotani,KenjiNoda, KazuhiroTasaka,KojiYamanaka, andKuniakiKoyama. A
55-ns16-Mb DRAM with Built-in Self-Test FunctionUsing Microprogram ROM.
IEEE Journal of Solid-State Circuits, 25(4):903–910, August 1990.
[TTS 86] Masahide Takada, Toshio Takeshima, Mitsuru Sakamoto, Toshiyuki Shimizu,
Hitoshi Abiko, Takuya Katoh, Masanori Kikuchi, Sakari Takahashi, Yoshinori
Sato, and Yasukazu Inoue. A 4-Mbit DRAM with Half-Internal-Voltage Bit-Line
Precharge. IEEE Journal of Solid-State Circuits, 21(5), October 1986.
[TTT 94] Satoru Tanoi, Yasuhiro Tanaka, Tetsuy Tanabe, Akio Kita, Toshio Inada, Ryoji
Hamazaki, Yoshio Ohtsuki, and Masaru Uesugi. A 32-Bank 256-Mb DRAM
with Cache and TAG. IEEE Journal of Solid-State Circuits, 29(11):1330–1336,
November 1994.
[TTU 91] Masao Taguchi, Hiroyoshi Tomita, Toshiya Uchida, Yasuhiro Ohnishi, Kimiaki
Sato, Taiji Ema, Masaaki Higashitani, and Takashi Yabu. A 40-ns 64-Mb DRAM
with 64-b Parallel Data Bus Architecture. IEEE Journal of Solid-State Circuits,
26(11):1493–1497, November 1991.
[UKY84] Masaru Uya, Katsuyuki Kaneko, and Juro Yasui. A CMOS Floating Point Multi-
plier. IEEE Journal of Solid-State Circuits, 19(5):697–702, October 1984.
350[USO 93] Katsuhiko Ueda, Toshio Sugimura, Minoru Okamoto, Shinichi Marui, Toshihiro
Ishikawa, and Mikio Sakakihara. A 16b Low-Power-Consumption Digital Signal
Processor. In 1993 IEEE International Solid-State Circuits Conference, Digst of
Technical Papers, pages 28–29. IEEE, February 1993.
[VBB93] JosephVarghese, MichaelButts,and JonBatcheller. AnEfﬁcient LogicEmulation
System. IEEE Transactions on Very Large Scale Integration (VLSI) Syatems,
1(2):171–174, June 1993.
[Vil82] W. Vilkelis. Lead Reduction Among Combinational Logic Circuits. IBM Journal
of Research and Development, 26(3):342–348, May 1982.
[vMWvW 86] Jef van Meerbergen, Frank Welten, Frans van Wijk, Jan Stoter, Jos Huisken,
Antoine Delaruelle, and Karel Van Eerdewijk. An 8 MIPS CMOS Digital Signal
Processor. In 1985 IEEE International Solid-State Circuits Conference, Digst of
Technical Papers, pages 84–85. IEEE, February 1986.
[vN66] John von Neumann. Theory of Self-Reproducing Automata. University of Illinois
Press, 1966. Compiled by Arthur W. Burks.
[VPP 89] Peter Voss, Leo Pfennings, Cathal Phelan, Cormac O’Connell, Thomas Davies,
Hans Ontrop, Simon Bell, and Roelof Salters. A 14-ns 256K 1 CMOS SRAM
with Multiple Test Modes. IEEE Journal of Solid-State Circuits, 24(4):874–881,
August 1989.
[VSCZ96] John Villasenor, Brian Schoner, Kang-Ngee Chia, and Charles Zapata. Conﬁg-
urable Computer Solutions for Automatic Target Recognition. In Proceedings
of the IEEE Workshop on FPGAs for Custom Computing Machines. IEEE, April
1996.
[WBEK 88] Todd Williams, Kenneth Beilstein, Badih El-Kareh, Roy Flaker, Gregory Graven-
ites, Robert Lipa, Hsing-San Lee, Joseph Maslack, John Pessetto, William F.
Pokorny, Michael Roberge, and Harold Zeller. An Experimental 1-Mbit CMOS
SRAMwithConﬁgurableOrganizationandOperation. IEEEJournalofSolid-State
Circuits, 23(5):1085 ff., October 1988.
[WBS 87] Karl Wang, Mark Bader, Vince Soorholtz, Richard Mauntel, Horacio Mendez,
Peter Voss, and Roger Kung. A 21-ns 32K 8 CMOS Static RAM with a Selec-
tivelyPumpedp-WellArray. IEEEJournalof Solid-StateCircuits,22(5):704–712,
October 1987.
[WC96] Ralph D. Wittig and Paul Chow. OneChip: An FPGA Processor With Recon-
ﬁgurable Logic. In Proceedings of the IEEE Workshop on FPGAs for Custom
Computing Machines, Los Alamitos, California, April 1996. IEEE Computer So-
ciety, IEEE Computer Society Press. http://www.eecg.toronto.edu/
˜wittig/thesis.description.html .
351[WDW 85] Frank Welten, Antoine Delaruelle, Frans Van Wyk, Jef Van Meerbergen, Josef
Schmid, Klaus Rinner, Karel Van Eedewijk, and Jan Wittek. A 2- m CMOS
10-MHz Microprogrammable Signal Processing Core with an On-Chip Multiport
Memory Bank. IEEE Journal of Solid-State Circuits, 20(3):754–760, June 1985.
[WH95] Michael J. Wirthlin and Brad L. Hutchings. A Dynamic Instruction Set Computer.
In Peter Athanas and Ken Pocek, editors, Proceedings of the IEEE Workshop on
FPGAs for Custom Computing Machines, Los Alamitos, California, April 1995.
IEEE Computer Society, IEEE Computer Society Press.
[WHG94] Michael J. Wirthlin, Brad L. Hutchings, and Kent L. Gilson. The Nano Processor:
a Low Resource Reconﬁgurable Processor. In Duncan A. Buell and Kenneth L.
Pocek,editors, Proceedingsofthe IEEEWorkshopon FPGAsforCustomComput-
ingMachines,pages23–30,LosAlamitos, California,April1994.IEEEComputer
Society, IEEE Computer Society Press.
[WHS 87] Tomohisa Wada, Toshihiko Hirose, Hirofumi Shinohara, Yuji Kawai, Kojiro
Yuzuriha, Yoshio Kohno, and Shimpei Kayano. A 34-ns 1-Mbit CMOS SRAM
Using Triple Polysilicon. IEEE Journal of Solid-State Circuits, 22(5):727–732,
October 1987.
[WOI 89] Shigeyoshi Watanabe, Yukihito Oowaki, Yasuo Itoh, Koji Sakui, Kenji Numata,
Tsuneaki Fuse, Takayuki Kobayashi, Kenji Tsuchida, Masahiko Chiba, Takahiko
Hara, Masako Ohta, Fumio Horiguchi, Katsuhiko Hieda, Akihiro Nitayama,
Takeshi Hamamoto, Kazunori Ohuchi, and Fujio Masuoka. An Experimental
16-Mbit CMOS DRAM Chip with a 100-MHz Serial READ/WRITE Mode. IEEE
Journal of Solid-State Circuits, 24(3):763–770, June 1989.
[Xil89] Xilinx, Inc., 2100 Logic Drive, San Jose, CA 95124. The Programmable Gate
Array Databook, 1989.
[Xil91] Xilinx, Inc., 2100 Logic Drive, San Jose, CA 95124. XC5200 FPGA Preliminary
Prodcut Speciﬁcation, version 4.0 edition, June 1991. http://www.xilinx.
com/partinfo/5200.pdf.
[Xil94a] Xilinx, Inc., 2100 Logic Drive, San Jose, CA 95124. The Programmable Logic
Data Book, 1989, 1994.
[Xil94b] Xilinx, Inc., 2100 Logic Drive, San Jose, CA 95124. The Programmable Logic
Data Book, 1994.
[Xil96] Xilinx, Inc., 2100 Logic Drive, San Jose, CA 95124. XC6200 FPGA Advanced
Product Speciﬁcation, version 1.0 edition, June 1996. http://www.xilinx.
com/partinfo/6200.pdf.
[YFJ 87] Jeff Yetter, Mark Forsyth, William Jaffe, Darius Tanksalvala, and John Wheeler.
A 15 MIPS 32b Microprocessor. In 1987 IEEE International Solid-State Circuits
Conference, Digest of Technical Papers, pages 26–27. IEEE, February 1987.
352[YJY 90] ToshiakiYoshino,RajeevJain,PaulYang,HarveyDavis,WandaGass,andAshwin
Shah. A 100-MHz64-Tap FIRDigital Filter in 0.8 m BiCMOSGate Array. IEEE
Journal of Solid-State Circuits, 25(6):1494–1501, December 1990.
[YKF 94] NobuyukiYamashita, TohruKimura, YoshihiroFujita,YoshiharuAimoto, Takashi
Manabe, Shin’ichiro Okazaki, Kazuyuki Nakamura, and Masakazu Yamashina. A
3.84 GIPs Integrated Memory ArrayProcessor with 64 Processing Elements and a
2-MbSRAM. IEEEJournalofSolid-StateCircuits,29(11):1336–1343,November
1994.
[YKK 84] Takashi Yamanaka, Shigeru Koshimaru, Osamu Kudoh, Yakashi Ozawa,
Nobuyuoka, Hiroshiito, Hidehiro Asai, Nobuyuki Harashima, and Shinichi
Kikuchi. A 25 ns 64K Static RAM. IEEE Journal of Solid-State Circuits, 19(5),
October 1984.
[YKMI88] Toshio Yamada, Hisakazu Kotani, Junko Matsushima, and Michihiro Inoue. A
4-MbitDRAMwith 16-bit ConcurrentECC. IEEE Journalof Solid-StateCircuits,
23(1), February 1988.
[YNH 91] Toshio Yamada, Yoshiro Nakata, Junko Hasegawa, Noriaki Amano, Akinori
Shibayama, Masaru Sasago, Naoto Matsuo, Toshiki Yabu, Susumu Matsumoto,
Shozo Okada, and Michihiro Inoue. A 64-Mb DRAM with Meshed Power Line.
IEEE Journal of Solid-State Circuits, 26(11):1506–1510, November 1991.
[YR95] Alfred K. Yeung and Jan M. Rabaey. A 2.4 GOPS Data-Drivern Reconﬁgurable
Multiprocessor IC for DSP. In Proceedings of the 1995 IEEE International Solid-
State Circuits Conference, pages 108–109. IEEE, February 1995.
[YTN 85] ShoYamamoto, Nobuyoshi Tanimura, Kouichi Nagasawa, SatoshiMeguro, Toku-
masa Yasui, Osamu Minato, and Toshiaki Masuhara. A 256K CMOS SRAM with
VariableImpedanceData-LineLoads. IEEEJournalof Solid-StateCircuits, 20(5),
October 1985.
[YYN 90] Kazuo Yano, Toshiaki Yamanaka, Takashi Nishida, Masayoshi Saito, Katsuhiro
Shimohigashi, and Akihiro Shimizo. A 3.8-ns CMOS 16 16-b Multiplier Us-
ing Complementary Pass-Transistor Logic. IEEE Journal of Solid-State Circuits,
25(2):388–395, August 1990.
353