Brigham Young University

BYU ScholarsArchive
Theses and Dissertations
2011-11-28

Improved Stereo Vision Methods for FPGA-Based Computing
Platforms
Wade S. Fife
Brigham Young University - Provo

Follow this and additional works at: https://scholarsarchive.byu.edu/etd
Part of the Electrical and Computer Engineering Commons

BYU ScholarsArchive Citation
Fife, Wade S., "Improved Stereo Vision Methods for FPGA-Based Computing Platforms" (2011). Theses
and Dissertations. 2745.
https://scholarsarchive.byu.edu/etd/2745

This Dissertation is brought to you for free and open access by BYU ScholarsArchive. It has been accepted for
inclusion in Theses and Dissertations by an authorized administrator of BYU ScholarsArchive. For more
information, please contact scholarsarchive@byu.edu, ellen_amatangelo@byu.edu.

Improved Stereo Vision Methods for
FPGA-Based Computing Platforms

Wade S. Fife

A dissertation submitted to the faculty of
Brigham Young University
in partial fulfillment of the requirements for the degree of
Doctor of Philosophy

James K. Archibald, Chair
Dah-Jye Lee
Doran K. Wilde
Randal W. Beard
Brent E. Nelson

Department of Electrical and Computer Engineering
Brigham Young University
December 2011

Copyright © 2011 Wade S. Fife
All Rights Reserved

ABSTRACT
Improved Stereo Vision Methods for
FPGA-Based Computing Platforms
Wade S. Fife
Department of Electrical and Computer Engineering, BYU
Doctor of Philosophy
Stereo vision is a very useful, yet challenging technology for a wide variety of applications.
One of the greatest challenges is meeting the computational demands of stereo vision applications
that require real-time performance. The FPGA (Field Programmable Gate Array) is a readilyavailable technology that allows many stereo vision methods to be implemented while meeting the
strict real-time performance requirements of some applications. Some of the best results have been
obtained using non-parametric stereo correlation methods, such as the rank and census transform.
Yet relatively little work has been done to study these methods or to propose new algorithms based
on the same principles for improved stereo correlation accuracy or reduced resource requirements.
This dissertation describes the sparse census and sparse rank transforms, which significantly reduce the cost of implementation while maintaining and in some case improving correlation
accuracy. This dissertation also proposes the generalized census and generalized rank transforms,
which opens up a new class of stereo vision transforms and allows the stereo system to be even
more optimized, often reducing the hardware resource requirements.
The proposed stereo methods are analyzed, providing both quantitative and qualitative results for comparison to existing algorithms. These results show that the computational complexity
of local stereo methods can be significantly reduced while maintaining very good correlation accuracy.
A hardware architecture for the implementation of the proposed algorithms is also described and the actual resource requirements for the algorithms are presented. These results confirm that dramatic reductions in hardware resource requirements can be achieved while maintaining
high stereo correlation accuracy.
This work proposes the multi-bit census, which provides improved pixel discrimination as
compared to the census, and leads to improved correlation accuracy with some stereo configurations. A rotation-invariant census transform is also proposed and can be used in applications where
image rotation is possible.

Keywords: stereo vision, local methods, census transform, rank transform, generalized census,
generalized rank, multi-bit census, rotation-invariant census, FPGA, Helios

TABLE OF CONTENTS

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xi

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Chapter 1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.1

Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

1.2

Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

Chapter 2
2.1

2.2

2.3

2.4

Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

The Machine Vision Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

2.1.1

General Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

Platform Alternatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

2.2.1

General-Purpose Processors . . . . . . . . . . . . . . . . . . . . . . . . .

9

2.2.2

Programmable Digital Signal Processors . . . . . . . . . . . . . . . . . . .

9

2.2.3

Graphics Processing Units . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.4

Other Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2.5

Custom Integrated Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2.6

Summary and Future Evolution . . . . . . . . . . . . . . . . . . . . . . . 15

The Field Programmable Gate Array . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.1

Overview of the Modern FPGA . . . . . . . . . . . . . . . . . . . . . . . 16

2.3.2

Advantages and Disadvantages of FPGA Technology . . . . . . . . . . . . 17

2.3.3

Characteristics of FPGA Processing . . . . . . . . . . . . . . . . . . . . . 19

Previous Work in Stereo Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4.1

Chapter 3

Comparing Correlation Accuracy . . . . . . . . . . . . . . . . . . . . . . 31

Algorithm Suitability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.1

Global Stereo Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.2

Suitable Algorithms for Real-time FPGA Implementation . . . . . . . . . . . . . . 37

3.3

Characteristics of Local Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

v

Chapter 4
4.1

4.2

4.3

4.4

The Sparse Census Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.1.1

Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.1.2

Census Transform Redundancy . . . . . . . . . . . . . . . . . . . . . . . 43

4.1.3

The Value of Non-Redundancy . . . . . . . . . . . . . . . . . . . . . . . . 47

4.1.4

Neighborhood Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.1.5

Sparse Census Transform Correlation Accuracy . . . . . . . . . . . . . . . 54

The Sparse Rank Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2.1

Motivation and Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.2.2

Sparse Rank Transform Correlation Accuracy . . . . . . . . . . . . . . . . 62

Qualitative Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3.1

Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.3.2

Accuracy Without Rectification . . . . . . . . . . . . . . . . . . . . . . . 67

4.3.3

Accuracy of Existing Local Methods . . . . . . . . . . . . . . . . . . . . 68

4.3.4

Sparse Census Transform Accuracy . . . . . . . . . . . . . . . . . . . . . 70

4.3.5

Sparse Rank Transform Accuracy . . . . . . . . . . . . . . . . . . . . . . 71

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

Chapter 5
5.1

5.2

Sparse Non-Parametric Transforms . . . . . . . . . . . . . . . . . . . . . 41

Generalized Non-Parametric Transforms . . . . . . . . . . . . . . . . . . 85

The Generalized Census Transform . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.1.1

Motivation and Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.1.2

Reduced Hardware Resource Requirements . . . . . . . . . . . . . . . . . 88

5.1.3

Characteristics of the Generalized Census Transform . . . . . . . . . . . . 90

5.1.4

Generalized Census Correlation Accuracy . . . . . . . . . . . . . . . . . . 92

The Generalized Rank Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.2.1

Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

5.2.2

Characteristics of the Generalized Rank Transform . . . . . . . . . . . . . 96

5.2.3

Examples of the Generalized Rank Transform . . . . . . . . . . . . . . . . 98

5.3

Qualitative Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 101

5.4

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

vi

Chapter 6

Hardware Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

6.1

Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

6.2

Hardware Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.2.1

Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

6.2.2

Memory Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

6.3

Resource Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

6.4

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

Chapter 7

The Multi-bit Census Transform . . . . . . . . . . . . . . . . . . . . . . . 133

7.1

Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

7.2

Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

7.3

Hardware Implementation of a Multi-bit Census Transform . . . . . . . . . . . . . 135

7.4

Multi-bit Census Transform Correlation Accuracy . . . . . . . . . . . . . . . . . . 136

7.5

Cost Versus Benefit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

7.6

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

Chapter 8

A Rotation-Invariant Census Transform . . . . . . . . . . . . . . . . . . . 143

8.1

Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

8.2

Graph Point Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

8.3

Accuracy of the Rotation-Invariant Census Transform . . . . . . . . . . . . . . . . 148

8.4

Uniqueness of the Rotation-Invariant Census Transform . . . . . . . . . . . . . . . 149

8.5

A Rotation-Invariant Rank Transform . . . . . . . . . . . . . . . . . . . . . . . . 152

8.6

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

Chapter 9
9.1

Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 153

Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
Appendix A Introduction to Stereo Vision . . . . . . . . . . . . . . . . . . . . . . . . . 169
A.1 Camera Models and Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
A.2 A Simple Stereo Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
vii

A.3 The Correspondence Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
A.4 Camera Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
A.5 Epipolar Geometry and Rectification . . . . . . . . . . . . . . . . . . . . . . . . . 176
A.6 Problems with Stereo Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
A.7 Window Summing Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
Appendix B Stereo Vision Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
B.1 Quantitative Stereo Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
B.1.1

Error Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

B.1.2

Pixel Categorizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

B.1.3

Proposed Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

B.1.4

Stereo Method Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 190

B.2 Preprocessing Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
B.2.1

Zero Mean Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191

B.2.2

Noise Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

B.2.3

Laplacian of Gaussian Filter . . . . . . . . . . . . . . . . . . . . . . . . . 196

B.2.4

The Rank Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202

B.2.5

Preprocessing Combinations . . . . . . . . . . . . . . . . . . . . . . . . . 206

B.2.6

Summary of Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . 207

B.3 Correlation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
B.3.1

Classical Similarity Measures . . . . . . . . . . . . . . . . . . . . . . . . 209

B.3.2

Non-Uniform Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . 214

B.3.3

Sparse Correlation Windows . . . . . . . . . . . . . . . . . . . . . . . . . 217

B.3.4

Reduced Image Data Width . . . . . . . . . . . . . . . . . . . . . . . . . 221

B.3.5

Multiple Window Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 222

B.3.6

The Census Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234

B.3.7

Summary of Correlation Measures . . . . . . . . . . . . . . . . . . . . . . 237

B.4 Post-Processing Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
Appendix C Stereo Vision Components . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
C.1 Image Rectification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247

viii

C.2 Neighborhood Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
C.3 Preprocessing Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
C.4 Implementing the Left-Right Consistency Check . . . . . . . . . . . . . . . . . . 253
C.5 Pipelining the LRCC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
C.6 System Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
Appendix D The Helios Robotic Vision Platform . . . . . . . . . . . . . . . . . . . . . . 259
D.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
D.2 Design Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
D.2.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
D.2.2 Flexibility and Expandability . . . . . . . . . . . . . . . . . . . . . . . . . 263
D.2.3 Size and Weight . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
D.2.4 Power Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
D.2.5 Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
D.2.6 Component Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
D.2.7 Compatibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
D.3 Helios System Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
D.3.1 The Stacking Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
D.3.2 Essential Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
D.4 System Component Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
D.4.1 General-Purpose Processor . . . . . . . . . . . . . . . . . . . . . . . . . . 272
D.4.2 FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
D.4.3 Random Access Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
D.4.4 Non-volatile Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
D.4.5 Communications Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . 296
D.4.6 Expansion Connector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
D.4.7 Simple I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
D.4.8 Clock Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
D.4.9 Power Supplies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
D.4.10 Resistors and Capacitors . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
D.5 Printed Circuit Board Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
ix

D.5.1 Signal Integrity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
D.5.2 Layer Stackup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
D.5.3 Final Layout and Organization . . . . . . . . . . . . . . . . . . . . . . . . 319
D.6 Production . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
D.7 Helios Specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
D.8 Daughter Boards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
D.9 Intellectual Property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
D.10 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
D.11 Limitations and Future Improvements . . . . . . . . . . . . . . . . . . . . . . . . 335
Appendix E Glossary of Acronyms and Abbreviations . . . . . . . . . . . . . . . . . . 341
E.1 Acronyms and Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341

x

LIST OF TABLES

3.1

Classic Local Stereo Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.2

Performance of the Local Stereo Methods . . . . . . . . . . . . . . . . . . . . . . 39

4.1

Sparse Neighborhood Correlation Accuracy . . . . . . . . . . . . . . . . . . . . . 48

4.2

Sparse Neighborhood Correlation Accuracy with Noise . . . . . . . . . . . . . . . 50

4.3

Sparse Census Transform, Correlation Accuracy . . . . . . . . . . . . . . . . . . . 56

4.4

Sparse Census Transform, Correlation Accuracy on Noisy Images . . . . . . . . . 57

4.5

Revised Sparse Census Transform, Accuracy on Noisy Images . . . . . . . . . . . 58

4.6

Revised Sparse Census Transform, Accuracy on Noiseless Images . . . . . . . . . 59

4.7

Sparse Rank Transform, Correlation Accuracy . . . . . . . . . . . . . . . . . . . . 62

4.8

Sparse Rank Transform, Correlation Accuracy on Noisy Images . . . . . . . . . . 63

4.9

Revised Sparse Rank Transform, Correlation Accuracy on Noisy Images . . . . . . 64

4.10 Bit-Optimized Sparse Rank, Correlation Accuracy on Noisy Images . . . . . . . . 65
4.11 Algorithm Parameters Used for Qualitative Comparison . . . . . . . . . . . . . . . 70
5.1

Generalized Census Transform, Correlation Accuracy . . . . . . . . . . . . . . . . 94

5.2

Generalized Census Transform, Correlation Accuracy on Noisy Images . . . . . . 94

5.3

Revised Generalized Census Transform, Correlation Accuracy . . . . . . . . . . . 95

5.4

Generalized Rank Transform, Correlation Accuracy . . . . . . . . . . . . . . . . . 100

5.5

Generalized Rank Transform, Correlation Accuracy on Noisy Images . . . . . . . 100

6.1

Stereo Correlation Resource Usage . . . . . . . . . . . . . . . . . . . . . . . . . . 130

7.1

Thresholding and Encoding for |T | = (−2, 0, 2) . . . . . . . . . . . . . . . . . . . 136

7.2

Multi-bit Census Accuracy Comparison, 4-edge Generalized Census . . . . . . . . 137

7.3

Multi-bit Census Accuracy Comparison, 8-edge Generalized Census . . . . . . . . 138

7.4

Multi-bit Census Accuracy Comparison, 16-edge Generalized Census . . . . . . . 138

7.5

Multi-bit Census Hardware Resource Benefit . . . . . . . . . . . . . . . . . . . . 141

8.1

Number of Points for Minimum Deviation in a Circular Graph . . . . . . . . . . . 148

B.1 Correlation Accuracy for Various LoG Kernels . . . . . . . . . . . . . . . . . . . 202
xi

B.2 Data Width Requirements for Rank-Transformed Images . . . . . . . . . . . . . . 206
B.3 Preprocessing Method Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
B.4 Sparse Correlation Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
B.5 Correlation Method Accuracy Summary . . . . . . . . . . . . . . . . . . . . . . . 239
B.6 Comparison of LRCC Effectiveness for Various Methods . . . . . . . . . . . . . . 245
D.1 Virtex-4 LX and FX Cost Per CLB . . . . . . . . . . . . . . . . . . . . . . . . . . 278
D.2 Xilinx Virtex-4 FX Model Features . . . . . . . . . . . . . . . . . . . . . . . . . . 281
D.3 Xilinx Virtex-4 FX Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
D.4 Xilinx Virtex-4 FX Pricing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
D.5 DRAM and SRAM Memory Characteristics . . . . . . . . . . . . . . . . . . . . . 287
D.6 Mobile SDR and DDR SDRAM Comparison . . . . . . . . . . . . . . . . . . . . 291
D.7 Helios Simple I/O Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
D.8 Helios Component Voltages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
D.9 Helios Power Supply Specification . . . . . . . . . . . . . . . . . . . . . . . . . . 307
D.10 Helios Specifications Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
D.11 Helios Power Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334

xii

LIST OF FIGURES

2.1

SMW Correlation Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.2

MSW Correlation Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.3

Common Stereo Image Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.4

Common Stereo Datasets with Ground-Truth Data . . . . . . . . . . . . . . . . . . 29

2.5

Training Stereo Image Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.6

Disparity Ground-Truth for Training Image Datasets . . . . . . . . . . . . . . . . 31

2.7

Evaluation Stereo Image Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.8

Disparity Ground-Truth for Evaluation Image Datasets . . . . . . . . . . . . . . . 33

2.9

Small Difference in Correlation Accuracy . . . . . . . . . . . . . . . . . . . . . . 34

4.1

Census Transform Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.2

Census Transform and SHD Combined Comparisons . . . . . . . . . . . . . . . . 45

4.3

Asymmetry Caused by a Non-Redundant Census Neighborhood . . . . . . . . . . 47

4.4

Sparse Census Neighborhood Comparison . . . . . . . . . . . . . . . . . . . . . . 48

4.5

Census Transform Neighborhood Point Contribution . . . . . . . . . . . . . . . . 52

4.6

Census Transform Neighborhood Point Contribution with Noisy Images . . . . . . 53

4.7

Sparse Census Transform Neighborhoods . . . . . . . . . . . . . . . . . . . . . . 55

4.8

Sparse Census Transform Neighborhoods for Noisy Images . . . . . . . . . . . . . 56

4.9

Real-world Test Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.10 16-bit Census Without Rectification . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.11 Disparity Images for Several Local Methods on Pole Images . . . . . . . . . . . . 72
4.12 Disparity Images for Several Local Methods on Hydrant Images . . . . . . . . . . 73
4.13 Disparity Images for Several Local Methods on Spillway Images . . . . . . . . . . 74
4.14 Disparity Images for Several Local Methods on Driveway Images . . . . . . . . . 75
4.15 Sparse Census with Pole Images . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.16 Sparse Census with Hydrant Images . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.17 Sparse Census with Spillway Images . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.18 Sparse Census with Driveway Images . . . . . . . . . . . . . . . . . . . . . . . . 79
4.19 Sparse Rank with Pole Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
xiii

4.20 Sparse Rank with Hydrant Images . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.21 Sparse Rank with Spillway Images . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.22 Sparse Rank with Driveway Images . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.1

4-Point Census Transform Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.2

Non-Redundant 2-Point Census Transform Graphs . . . . . . . . . . . . . . . . . 87

5.3

Modified 2-Edge Census Transform Graphs . . . . . . . . . . . . . . . . . . . . . 88

5.4

Sparse and General Census Transform Window Size Comparison . . . . . . . . . . 89

5.5

Redundant Generalized Census Example . . . . . . . . . . . . . . . . . . . . . . . 90

5.6

Asymmetric Generalized Census Example . . . . . . . . . . . . . . . . . . . . . . 91

5.7

Reduced Spread of Generalized Census . . . . . . . . . . . . . . . . . . . . . . . 92

5.8

Generalized Census Transform Graphs . . . . . . . . . . . . . . . . . . . . . . . . 93

5.9

Revised Generalized Census Transform Graphs for Noisy Images . . . . . . . . . . 95

5.10 Generalized Rank Transform Cancellation . . . . . . . . . . . . . . . . . . . . . . 97
5.11 Generalized Rank Transform Redundancy . . . . . . . . . . . . . . . . . . . . . . 98
5.12 Generalized Rank Transform Examples . . . . . . . . . . . . . . . . . . . . . . . 99
5.13 Generalized Census with Pole Images . . . . . . . . . . . . . . . . . . . . . . . . 102
5.14 Generalized Census with Hydrant Images . . . . . . . . . . . . . . . . . . . . . . 103
5.15 Generalized Census with Spillway Images . . . . . . . . . . . . . . . . . . . . . . 104
5.16 Generalized Census with Driveway Images . . . . . . . . . . . . . . . . . . . . . 105
5.17 Generalized Rank with Pole Images . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.18 Generalized Rank with Hydrant Images . . . . . . . . . . . . . . . . . . . . . . . 107
5.19 Generalized Rank with Spillway Images . . . . . . . . . . . . . . . . . . . . . . . 108
5.20 Generalized Rank with Driveway Images . . . . . . . . . . . . . . . . . . . . . . 109
6.1

General Stereo Correlation Architecture . . . . . . . . . . . . . . . . . . . . . . . 117

6.2

Proposed Correlation Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 119

6.3

Similarity Module Based on Previous Work . . . . . . . . . . . . . . . . . . . . . 120

6.4

Proposed Similarity Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

6.5

Memory Requirements for Proposed Similarity Module . . . . . . . . . . . . . . . 122

6.6

Preprocessing Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

xiv

6.7

Memory Required for Proposed Architecture Using Generalized Census . . . . . . 124

6.8

Memory Required for Proposed Architecture Using Generalized Rank . . . . . . . 125

6.9

Total Memory as a Function of Transform Window Size . . . . . . . . . . . . . . . 128

7.1

Correlation Accuracy vs. Threshold Set (|T | = 3) . . . . . . . . . . . . . . . . . . 139

7.2

Correlation Accuracy vs. Threshold Set (|T | = 5) . . . . . . . . . . . . . . . . . . 140

8.1

Circular Census Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

8.2

Alternate Circular Census Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

8.3

Discrete Circular Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

8.4

Average Deviation from Ideal Graph . . . . . . . . . . . . . . . . . . . . . . . . . 149

8.5

Fraction of Incorrect Census Bits . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

8.6

Rotation-Invariant Census Uniqueness . . . . . . . . . . . . . . . . . . . . . . . . 151

8.7

Average Deviation from the Correct Match . . . . . . . . . . . . . . . . . . . . . . 151

A.1 Pinhole Camera . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
A.2 Pinhole Camera Projection Model . . . . . . . . . . . . . . . . . . . . . . . . . . 171
A.3 Canonical Stereo Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
A.4 The Tsukuba Image Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
A.5 Window-Based Stereo Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
A.6 Epipolar Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
A.7 Field of View Occlusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
A.8 Point Occlusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
A.9 Object Border Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
A.10 Object Distortion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
A.11 Window Summing Optimization Using Column Sums . . . . . . . . . . . . . . . . 181
A.12 Window Summing Optimization Using Row Sums . . . . . . . . . . . . . . . . . 183
B.1 Correlation Accuracy vs. SAD Window Size . . . . . . . . . . . . . . . . . . . . . 191
B.2 Salt and Pepper Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
B.3 Correlation Accuracy vs. Median Filter Size . . . . . . . . . . . . . . . . . . . . . 194
B.4 Correlation Accuracy vs. Gaussian Filter Standard Deviation . . . . . . . . . . . . 195

xv

B.5 The LoG Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
B.6 Correlation Accuracy vs. LoG Sigma . . . . . . . . . . . . . . . . . . . . . . . . . 199
B.7 Correlation Accuracy vs. LoG Sigma and SAD Window Size . . . . . . . . . . . . 200
B.8 Correlation Accuracy vs. LoG Kernel Size . . . . . . . . . . . . . . . . . . . . . . 200
B.9 Common LoG Kernel Approximations . . . . . . . . . . . . . . . . . . . . . . . . 201
B.10 Improved LoG Kernel Approximations . . . . . . . . . . . . . . . . . . . . . . . . 201
B.11 Correlation Accuracy vs. Rank Size and SAD Window Size . . . . . . . . . . . . . 204
B.12 Correlation Accuracy vs. Rank Transform Size . . . . . . . . . . . . . . . . . . . 205
B.13 Correlation Accuracy vs. SAD Window Size . . . . . . . . . . . . . . . . . . . . . 210
B.14 Correlation Accuracy vs. SSD Window Size . . . . . . . . . . . . . . . . . . . . . 211
B.15 Correlation Accuracy vs. NCC Window Size . . . . . . . . . . . . . . . . . . . . . 212
B.16 Classical Methods Correlation Accuracy Comparison . . . . . . . . . . . . . . . . 212
B.17 Classical Methods Correlation Accuracy with Noise . . . . . . . . . . . . . . . . . 214
B.18 SAD Correlation Accuracy with a Gaussian-Weighted Window . . . . . . . . . . . 216
B.19 Square Window and Gaussian Weighted Window Comparison . . . . . . . . . . . 217
B.20 Sparse Correlation Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
B.21 Normal and Sparse SAD Window Size Comparison . . . . . . . . . . . . . . . . . 219
B.22 Classical and Sparse Window Comparison . . . . . . . . . . . . . . . . . . . . . . 220
B.23 Average Correlation Accuracy with Reduced Data Width . . . . . . . . . . . . . . 222
B.24 Correlation Accuracy vs. SMW Subwindow Size . . . . . . . . . . . . . . . . . . 223
B.25 Correlation Accuracy vs. Modified SMW Subwindow Size . . . . . . . . . . . . . 225
B.26 SMW Disparity Map Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . 226
B.27 SMW Method Accuracy Comparison . . . . . . . . . . . . . . . . . . . . . . . . 226
B.28 Correlation Accuracy vs. Subwindow Size for SMW with LoG . . . . . . . . . . . 227
B.29 Correlation Accuracy vs. Subwindow Size for SMW with Rank . . . . . . . . . . . 228
B.30 Correlation Accuracy vs. MSW Subwindow Size . . . . . . . . . . . . . . . . . . 230
B.31 Correlation Accuracy vs. Subwindow Size for MSW with LoG . . . . . . . . . . . 231
B.32 Correlation Accuracy vs. Subwindow Size for MSW with Rank . . . . . . . . . . . 233
B.33 Correlation Accuracy vs. Census Size and SHD Window Size . . . . . . . . . . . . 235
B.34 Correlation Accuracy vs. Census Size for 13 × 13 SHD . . . . . . . . . . . . . . . 235

xvi

B.35 Original SAD and SAD with LRCC Comparison . . . . . . . . . . . . . . . . . . 241
B.36 Teddy Dataset SAD LRCC Disparity Map Comparison . . . . . . . . . . . . . . . 242
B.37 Original SAD and SAD with LRCC, Average Accuracy and DED . . . . . . . . . 243
B.38 Comparison of LRCC Effectiveness for Various Methods . . . . . . . . . . . . . . 244
B.39 Teddy Dataset Disparity Images After LRCC . . . . . . . . . . . . . . . . . . . . 246
C.1 Spatial Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
C.2 3 × 3 Sliding Window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
C.3 3 × 3 Window Buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
C.4 3 × 3 Window, Image Edge Overlap . . . . . . . . . . . . . . . . . . . . . . . . . 251
C.5 Preprocessing Operation Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 252
C.6 Correlation System Timing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
C.7 Correlation Architecture with LRCC . . . . . . . . . . . . . . . . . . . . . . . . . 255
C.8 Pipelined Correlation System Timing . . . . . . . . . . . . . . . . . . . . . . . . . 256
C.9 Fully Pipelined Correlation Architecture with LRCC . . . . . . . . . . . . . . . . 256
C.10 System Organization for Stereo Vision . . . . . . . . . . . . . . . . . . . . . . . . 257
D.1 Helios Stacking Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
D.2 Helios Layout Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
D.3 JTAG Chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
D.4 SDRAM Timing Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
D.5 ZBT SRAM Timing Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
D.6 Memory Termination Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
D.7 Switching Regulator Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
D.8 PCB Trace Current Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
D.9 PCB Trace with Ground Plane . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
D.10 Helios PCB Cost vs. Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
D.11 Helios PCB Layer Stackup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
D.12 Helios Board Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
D.13 Helios Board Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
D.14 Helios Power Planes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323

xvii

D.15 The Helios Board . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
D.16 Helios PCB Cost vs. Quantity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
D.17 Helios Assembly Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
D.18 Component Costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
D.19 Helios Board Costs Breakdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
D.20 Helios Daughter Boards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
D.21 Small Robotic Research Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
D.22 Robot Racers Senior Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334

xviii

CHAPTER 1.

INTRODUCTION

Since the invention of digital imagery, there has been significant interest in the ability
to extract useful information from images. Through many decades of research, a wide variety
of image processing and machine vision techniques have been developed that make this desire,
to a great extent, a reality. Of particular interest to many is the problem of extracting depth or
distance information from imagery, in much the same way humans and animals can with their
vision systems. Through the use of two cameras, much like the two eyes of a person, a computer
system can determine the disparity between points in the two views to estimate the distance to
those points in each view. This is generally referred to as stereo vision.
One of the great challenges of stereo vision is the amount of computation required to perform this depth extraction. For some applications, the amount of time required to execute a stereo
vision algorithm for a given image pair may not be very important. In these cases, virtually any
computer system with sufficient memory and access to image data can perform the computation.
However, many applications have some sort of real-time requirements regarding how quickly the
computation must be performed in order for the system to be useful. Additionally, the physical size,
weight, power consumption, cost, and durability of the computing platform may also be restricted
due to the needs of the application.
Many computing platforms exist that could be used to accelerate stereo vision algorithms
and/or reduce the physical requirements of the overall system. For a brief overview of such systems, see Section 2.2. One technology that has shown great promise as an image processing
platform for such real-time machine-vision applications is the field-programmable gate array, or
FPGA. These devices can be programmed to implement any custom logic circuit, giving them
functionality similar to a custom microchip, but with an ability to be reprogrammed that is similar
to a general-purpose computer. The ability to implement custom logic allows the FPGA to be tailored and optimized specifically for the target application. It also allows the designer to leverage

1

much of the parallelism inherent in an algorithm by creating multiple instances of logic resources
that can operate in parallel.
A wide variety of stereo vision algorithms have been proposed in the literature. Some
of these are well-suited to implementation on FPGA-based computing platforms due to the high
levels of parallelism that can be easily leveraged. Others, due to their complexity and/or the dynamic nature of their execution, are better suited to implementation on software-based computing
platforms. Most of the literature has focused on complex software-based algorithms and implementations, making it difficult to obtain from the literature an understanding of the relative performance of algorithms that are better suited for implementation in custom digital circuitry. This
problem is aggravated by the lack of analysis comparing the relative accuracy of the stereo results
of these FPGA-based algorithms using consistent or generally accepted image datasets and error
metrics. Furthermore, little has been done to compare the hardware resource requirements of the
best-performing algorithms using a common FPGA architecture or metrics.
This dissertation focuses on what are commonly called local stereo methods [1], which are
well-suited to FPGA implementation. In particular, the non-parametric rank and census transforms
have been shown to provide superior correlation accuracy among other local stereo vision methods. In this dissertation, new stereo vision algorithms based on non-parametric transforms will
be presented, including analysis of algorithm design, correlation accuracy, and implementation
requirements.

1.1

Contributions
This dissertation presents several new stereo vision algorithms that are based on non-

parametric transforms. First, the sparse census transform, which is based on the census transform
from a previous work [2], is defined and analyzed. This analysis extends the earlier work by providing a thorough study of the sparse implementation, describing its characteristics and presenting
strategies for the design of a sparse census transform.
A new transform, called the sparse rank transform, is also proposed. The stereo vision
algorithm based on this new transform has characteristics similar to the original rank transform
while allowing for a reduction in hardware resource requirements.

2

This dissertation also proposes and analyzes the generalized census and generalized rank
transforms. These are in fact supersets of the sparse census and sparse rank transforms as well
as the original census and rank. These generalized forms allow the stereo algorithms to be further optimized for the target application while reducing the hardware resource requirements and
maintaining correlation accuracy.
The proposed algorithms are analyzed both quantitatively, using standard stereo image
datasets, and qualitatively, using typical images taken with a stereo camera system under realworld conditions. The results for a variety of existing stereo methods are also shown for comparison. The results demonstrate that these algorithms provide very good correlation accuracy
despite the vast reduction in computational complexity. In fact, some of the proposed transforms
consistently outperform the original census and rank methods on the standard test images.
The purpose of these new stereo vision methods is to reduce the amount of hardware resources required for an implementation in custom logic while maintaining correlation accuracy.
Accordingly, a hardware architecture for the implementation of these algorithms is introduced and
analyzed. Several hardware optimizations are described to extend the previous work that has been
described in the literature. Actual hardware resource requirements are presented along with the correlation accuracy for various configurations of the proposed algorithms. The resource requirements
and accuracy of traditional algorithms are also shown, allowing for direct quantitative comparison.
These results confirm that dramatic hardware resource savings are possible with minimal loss in
correlation accuracy.
Additionally, this work also introduces the multi-bit census transform, which enhances
the proposed and existing census methods by providing better pixel discrimination. This allows
slightly better correlation accuracy to be achieved for some stereo configurations at the expense
of increased resource requirements. Finally, a rotation-invariant census transform is proposed that
can be used in applications where image rotation is possible.

1.2

Organization
Chapter 2 provides a brief overview of machine vision and computing platforms used to

implement machine vision systems. The FPGA, which is the target platform of this work, is then
introduced. This chapter concludes with an overview of previous work in stereo vision.
3

Chapter 3 provides an analysis of the suitability of different stereo vision algorithms for
real-time implementation using custom hardware. This provides a context for the rest of the dissertation and explains the motivation for its focus on local stereo methods and the non-parametric
transforms. The correlation accuracy of several traditional algorithms is provided and can be used
as a baseline for understanding the level of correlation accuracy made possible by the proposed
algorithms.
Chapter 4 introduces and analyzes the sparse census transform and the sparse rank transform. This leads to the introduction of the generalized rank and census transforms in Chapter 5.
Both quantitative and qualitative results for the correlation accuracy of each of these stereo methods will be presented.
Chapter 6 describes a hardware architecture that is well-suited to the implementation of the
proposed stereo methods. The resource requirements for several configurations of the proposed
stereo methods are presented along with the resource requirements of traditional methods. The
correlation accuracy is also shown so that the trade-off between accuracy and resource requirements can be understood.
Chapter 7 introduces the multi-bit census transform and describes its characteristics and
performance. Chapter 8 introduces a rotation-invariant census transform and demonstrates how
closely it matches the census transform of the rotated image.
Finally, Chapter 9 provides some conclusions and highlights potential areas for future research.
Several appendices are also provided for reference. Appendix A provides an overview
of the stereo vision problem for less-familiar readers. Appendix B is a survey and exploration of
existing stereo methods that are well-suited to real-time implementation in custom hardware. Many
of the insights gained here are used throughout this dissertation in order to tune the parameters
of the proposed stereo vision algorithms. Appendix C provides additional detail regarding the
hardware implementation of stereo algorithms that is not presented in Chapter 6. Appendix D
describes the Helios Robotic Vision Platform, a very-compact embedded computer intended for
real-time vision applications, such as stereo vision. Finally, Appendix E provides some definitions
for acronyms that are used throughout the dissertation.

4

CHAPTER 2.

2.1

BACKGROUND

The Machine Vision Problem
In order to properly understand the suitability of different computing platforms for appli-

cations in real-time machine vision, we must first have a general understanding of the algorithms
and computations that are typical of such vision systems. This will lead to an understanding of
why FPGAs are well-suited for the implementation of certain stereo vision algorithms. Due to the
wide variety of machine vision algorithms that have been developed, it is not practical to classify
all algorithms into well-defined computational categories. However, the vast majority of computation time in machine vision systems is spent performing fundamental image processing tasks, or
tasks which are computationally similar. This section will discuss some of the properties of typical
image processing operations as well as other processing typical of machine vision systems.

2.1.1

General Characteristics
One important note that influences how images are processed is the way data are transfered

from the imaging device to the computing system. Most image sensors output an image one pixel
at a time, starting in the upper-left corner of the image, then moving horizontally and outputting
the pixels row by row. There are some variations of this order, but generally images are delivered
pixel by pixel and row by row. As a result, many image processing implementations tend to be
streaming in nature. That is, the data from the image sensor tends to flow like a stream from one
computational block to another. In fact, the computation for most signal and image processing
applications tends to be most naturally expressed in this way [3]. The alternative is to read the
image from the image sensor and store all or parts of the image in random access memory, adding
latency to the system but eliminating the effects imposed by the streaming nature of the data.

5

Additionally, a few image sensors have been developed that can be read using random access
patterns [4], rather than imposing the requirement of sequential readout.
To better understand the kind of computing platform that is best suited to machine vision
systems, let us consider the characteristics of spatial image processing, the dominant type of processing utilized in real-time machine vision systems. In a spatial image processing operation, each
pixel of the processed output image is a direct function of some neighborhood of pixels from the
input image. Upon close inspection of a typical spatial filter, several interesting properties can be
observed.
• Abundant parallelism. The parallelism available in the application of a single spatial filter
to an image is limited only by the size of the image. In theory, it is possible to compute all
output pixels for a single image simultaneously, or an image could be divided into sections
that are processed independently. In practice, however, due to the way in which data is
streamed from the imaging device, some buffering will be employed to make the parallelism
available. In this case, the parallelism is limited only by the buffering strategy and memory
available. Finer-grained parallelism can be realized in the computations for each pixel, each
of which involves arithmetic with several pixel values (multiple-data). It is also possible to
increase parallelism by overlapping the execution of separate image processing operations.
For example, suppose a system needs to convolve two spatial filters with an image. It is
possible to begin applying the second filter long before convolution with the first filter has
completed. Yet more parallelism can be achieved by processing multiple images in parallel,
perhaps from separate image sensors. An ideal computing platform will need to be able to
take advantage of the parallelism inherent in image processing operations.
• Repeated computation. The computation of a spatial filter is normally identical for each
pixel. Only the input data change as we compute the value of each output pixel. As a result,
a limited number of operations and a limited number of instances of those operations are
needed in order to implement the computation. In the context of a software program, this
means that image processing filters tend to be very tight and simple loops of execution. In the
context of custom hardware, this means that image processing filters tend to require a small

6

amount of hardware resources. As a result, image processing filters can be implemented
quite efficiently using custom hardware.
• Spatial and temporal locality. For each output pixel to be processed, several pixels of input
data will be required. This collection of input pixel data is spatially located in the image near
the coordinate of the pixel being processed. With a typical imaging system, where pixels are
streamed row by row, this means that pixels on the same image row that are needed for
the computation will be spatially located near each other in the image data stream. In fact,
pixels that are adjacent in the image will be adjacent in the image data stream. Pixels that are
adjacent vertically in the input image will be separated by a fixed amount in the data stream,
where the amount corresponds to the width of the image. Because the distances between
the needed image pixels in the image data stream are always fixed for a given filter and
image width, relatively small and simple memory architectures are sufficient to implement
the buffering needed for an image processing filter.
Because in a typical imaging system, image data is delivered in a streaming fashion,
row by row, the data needed for image processing filters also exhibits strong temporal locality. That is, data that were used recently will likely be used again shortly in a future step
of the computation. In a typical implementation of a spatial filter, for each output pixel that
is to be computed we can reuse most of the data that was used for the computation of the
previous output pixel.
For example, consider a spatial filter that uses an N × N kernel, which normally
requires access to N 2 pixels of input image data in order to compute an output pixel. For
each output pixel that is computed, we can reuse N 2 − N of the input data that was used for
the computation of the adjacent output pixel. Only N new input pixels need to be read. This
represents a data reuse efficiency of (1 − 1/N) · 100%. This locality relationship holds in
both the horizontal and vertical directions in the image. As a result, the data reuse efficiency
can be improved even further by reusing input data from the processing of previous rows of
the image. This results in an implementation where only one new pixel needs to be read for
the computation of each output pixel. A computing platform that can easily take advantage
of this locality is likely to be well suited for machine vision applications.

7

• Various operations. The arithmetic computations performed by spatial filters used in machine vision systems may vary widely depending on the application. The computation may
involve multiplication and addition (e.g., convolution) using coefficients that are real numbers, simpler whole numbers, or possibly convenient powers of two. Or the computation may
involve non-linear operations such as sorting (e.g., the median filter and other rank filters).
Some algorithms may require filters involving Boolean operations at the bit level (e.g., the
non-parametric local transforms [5]). Other algorithms may require adaptive filters, where
the operations to be performed depend on the contents of the image. Many computing platforms are not well suited to support this wide variety of computation types. For example,
general-purpose computers are optimized for general arithmetic and cannot as easily take
full advantage of simple coefficients, implement sorting, or perform complex bit-level operations.
In order to efficiently implement a machine vision system, a computing platform will need
to be able to take advantage of the preceding characteristics, and not be significantly penalized by
any of them. If a platform’s architecture prevents it from benefiting from these traits then its use
may lead to an inefficient implementation of the vision system.
However, there is more processing done in typical vision systems than just that of spatial
filters and other highly regular computations. Although they make up the majority of machine vision processing, this kind of processing may be classified as computationally intensive preprocessing [6] in many vision systems. The post-processing stage that follows may involve higher-level
processing, such as scene interpretation, pattern recognition, tracking, and control. This type of
processing often does not exhibit the same characteristics as those described above, but instead
may be much more irregular and better suited for a different kind of computing platform all together. This suggests that a single computing platform, by itself, may not be the ideal choice for the
implementation of vision systems in general. Instead, a hybrid that combines multiple computing
devices, each well-suited to the different kinds of processing, would likely be a better alternative
at the system level.

8

2.2

Platform Alternatives
A wide variety of computing platforms have been used in machine vision systems. This

section will summarize some of the most common devices that have been used for vision systems
and will briefly describe their advantages and disadvantages.

2.2.1

General-Purpose Processors
By far, the most popular platform for machine vision systems has been the general-purpose

personal computer (PC) with its general-purpose processor (GPP), due to the ease of programming
and general availability. Despite the popularity of the general-purpose processor for machine vision
research, the platform has proven to be a relatively poor performer with regards to meeting the realtime performance constraints of machine vision applications. General-purpose processors, by their
very nature, are not meant to target a specific application. As a result, their architectures are not
ideally suited to the execution of typical image processing operations.
Due to great technological advances, general purpose computers have improved in performance to a level where many image processing operations can be performed in real time. However,
this generally comes at the expense of large complex systems with relatively high power consumption.

2.2.2

Programmable Digital Signal Processors
Programmable digital signal processors, or DSPs, became popular in the early 1980s. Born

out of the need for processors that could perform signal processing in real-time, DSPs are highly
programmable processors optimized for signal processing operations. As a result, DSPs represent
an excellent compromise between general-purpose programmability and signal processing performance.
Programmable DSPs achieve improved performance over traditional general-purpose processors by providing an instruction set architecture (ISA) that allows for more parallelism to be
extracted from typical signal processing operations. These ISA features make it easy to exploit the
fine-grained parallelism inherent in image processing, but do not necessarily provide an improvement in the execution of traditional computer programs that do not involve signal processing.
9

Unfortunately, DSPs still suffer from many of the same inefficiencies and overhead as
general-purpose processors. DSPs essentially take advantage of increased levels of parallelism
through additional functional units that allow an increased number of operations to occur in parallel. However, the number and type of functional units is still strictly limited.
Due to their high level of programmability and relative ease of use, programmable DSPs
are an excellent match for many machine vision systems. Additionally, due to the general market
demand for DSPs for embedded applications, these devices tend to have very low power consumption compared to processors for general-purpose computers, typically in the range of a few
hundred milliwatts to a few watts. This makes them particularly well suited to very small systems.
Yet for many machine vision applications that require computationally intensive, video-rate image
processing, currently available DSPs may not be able to provide sufficient performance to meet
the system’s real-time constraints or may do so much less efficiently. For such systems, other
computing platforms may be needed.

2.2.3

Graphics Processing Units
In recent years, graphics processing units, or GPUs, have received significant attention

as general-purpose, high-performance computing devices. Originally, these chips were designed
specifically for the computations involved in rendering and outputting high quality graphics on
general-purpose computers. However, as GPUs have become more advanced, the companies developing these chips have made significant progress in making them more general-purpose and
highly programmable. Today, GPUs are not just powerful graphics engines, but highly parallel, programmable processors. For an introduction to general-purpose computing on GPUs, refer
to [7].
Modern GPUs are capable of taking advantage of significant amounts of parallelism in
computations. For example, the NVIDIA GeForce 8800 GTX features 16 streaming multiprocessor groups. Each streaming multiprocessor group has a shared instruction cache, a shared data
cache, a 16 kB shared memory, control logic, and 8 stream processors. Each stream processor
within a group can perform 32-bit integer or 32-bit floating-point arithmetic. Combined with its
programming model, the GPU effectively becomes a multi-threaded, SIMD processor with a very
large number of functional units. The processors also have access to various memories, using
10

scatter and gather memory operations. This architecture, with its 384-bit memory interface and
1.35 GHz operating frequency, allows the GeForce 8800 GTX to provide up to 86.4 GB/s of memory bandwidth and over 330 billion floating-point operations per second (GFLOPS), well above
that of a high-end, general-purpose processor [7], [8].
The highly parallel architecture of GPUs makes them well suited to a variety of computationally intensive applications. Recently, GPUs have also been used to accelerate many machine
vision applications when used as coprocessors in general-purpose computers. Such applications
include optical flow [9], stereo vision [10]–[12], feature tracking [13], and other machine vision
applications. GPUs are particularly well suited to algorithms that benefit from floating-point operations, since floating-point capabilities are less commonly available on programmable DSPs.
When applied to very small systems, the GPU has many disadvantages. For example, GPUs
do not function as standalone processors, but instead as coprocessors to a general-purpose processor, typically in a general-purpose desktop computer. As a result, current systems using GPUs
tend to be very large. Additionally, GPUs tend to be very power hungry. The power consumption
of modern high-performance GPUs may approach or even exceed the power consumption of the
general-purpose processor acting as host. When all necessary components are combined, including
GPU, memory, and the host computer system, the power consumption of a GPU-based system can
be very high. This makes GPUs unsuitable as the computing platform for most small systems.

2.2.4

Other Architectures
Many other computing architectures have been proposed that have seen limited or no atten-

tion in the machine vision research community but that could be used as the computing platform for
many machine vision systems. One such platform is the Cell Broadband Engine Architecture, more
commonly known as the Cell processor [14]. This processor, well known as the processor in the
Sony PlayStation 3 gaming system, has also received significant attention as a high-performance
computing platform [15].
The Cell processor consists primarily of one host processor, called the Power Processor
Element (PPE), as well as multiple Synergistic Processor Elements (SPE). The PPE is a PowerArchitecture compliant, general-purpose processor. The SPEs are superscalar SIMD processors,
each with a small memory for local store and a sophisticated DMA controller for external memory
11

access. The Cell processor has not been widely used for machine vision applications and, as a
result, its suitability for machine vision applications is not well understood. Based on its generalpurpose nature and its suitability for many high-performance applications, the Cell processor seems
to be a worthy platform for further evaluation as a machine vision processor. However, like many
high-performance platforms, the Cell suffers from high power consumption. Additionally, the Cell
processor is generally only available in board-level forms, such as game systems, server blades, or
PCI Express accelerator boards, which are not well suited for many applications.
Another architecture that has received a fair amount of attention is the Stream Processor.
First proposed as the Imagine Stream Processor [3], [16], the concept eventually evolved into
the Storm-1 processor, which was produced by Stream Processors Inc. The architecture of these
stream processors is designed to take advantage of the streaming nature of many signal processing
applications, such as video and image processing, by exploiting the parallelism and locality in
these applications. The architecture has also been tested for a variety of applications, including
that of stereo vision. The architecture consists of a host processor connected to a stream controller,
which together control the execution of the stream processor. Feeding the system is a DRAM
memory interface that can read streams of data and transfer them to the stream register file (SRF).
This unique register file is large and is designed specifically for streaming data, allowing data
to be buffered in such a way as to take advantage of data locality. The SRF feeds arithmetic
clusters, each with its own local register file (LRF) and a 6-wide, VLIW ALU supporting addition,
multiplication, and division on 32-bit integer and floating-point data types. This architecture allows
the stream processor to extract large amounts of parallelism from signal processing applications,
while remaining highly programmable. A performance evaluation is provided in [17].
Another family of solutions includes large arrays of simple, general-purpose processors.
The Ambric Am2000 family of massively parallel processor arrays (MPPAs) is a single-chip solution consisting of an array of many 32-bit RISC processors on a single chip [18]. Marketed mainly
toward high-speed video compression systems, the architecture allows for very high levels of parallelism to be achieved due to the large number of independent processors available. Similarly,
the Tilera TILE64 [19] offers a grid of general-purpose processor cores (called tiles), each with its
own level-1 and level-2 caches and supporting VLIW instructions for increased parallelism. The
TILE64 processor is marketed mainly toward advanced networking and digital video applications.
12

The MathStar Arrix represents another class of devices [20]. The Arrix devices are referred
to as field programmable object arrays (FPOA). Similar to FPGAs, which will be discussed in
Section 2.3, FPOAs are high-performance, programmable logic devices, but have a much coarser
granularity than a typical FPGA. Rather than an array of very general LUTs (lookup tables) and
flip-flops, the Arrix FPOA is a heterogeneous array of less general core objects. This architecture
allows for much of the flexibility of FPGAs while being able to operate at higher clock rates
(e.g., 1 GHz). Other course grained configurable architectures have also been proposed in the
literature [21], [22].
The Element CXI ECA-64 processor [23], [24] is another example of new technology
intended to fill the gap between general-purpose processors and custom hardware designs. The
ECA-64 is a reconfigurable, heterogeneous array of elements. These elements include the BREO
(Bit Reorder), BSHF (Barrel Shifter), MEMU (Memory Unit), MULT (Multiplier), SALU (Super
ALU), TALU (Triple ALU), and SME (State Machine Engine). These elements are grouped into
zones, with four elements in each zone and a CSP (Crosspoint Switch) in the center of the zone,
allowing inexpensive communication between elements within the zone. Four zones, each with
a different combination of elements, make up a cluster. The ECA-64 is comprised of four such
clusters.
The Stretch processor family, from Stretch Inc., represents another point in the computing
platform design space [25]. Referred to as a Software Configurable Processor (SCP), this processor
family couples programmable logic, called the Instruction Set Extension Fabric (ISEF), with the
datapath of a 32-bit RISC processor (a Tensilica Xtensa processor). The ISEF is a heterogeneous
mix of ALU and multiplier arrays interconnected with a programmable routing fabric. Application
development begins with standard code which is then profiled and analyzed to identify the code,
typically the inner loops, that takes up most of the execution time. The tools then convert this code
to custom hardware that can be programmed on the ISEF, while attempting to take advantage of
parallelism inherent in the code. This is similar to the work done with PRISC [26], OneChip [27],
and other extensible processors.
Yet another category of solutions include highly configurable intellectual property (IP) solutions that can be tailored and synthesized into custom silicon. Examples include the Tensilica
Xtensa [28], ARC 600/700, and the MIPS Pro Series cores with CoreExtend capability. These are
13

highly configurable programmable processors, delivered to the customer as IP, that can be configured for a specific application and extended using completely new instructions in order to improve
performance prior to hardware synthesis.
The preceding discussion surveys a number of architectures that have achieved certain attention in the literature or some commercial success. These architectures currently represent a
very small portion of the processor market and some are no longer available. Industry is slow
to abandon proven technologies in favor of new technologies and new development paradigms. It
seems unlikely that many of the platforms discussed in this section will find significant commercial
success. It is more likely that most of them will slowly die off, as many unsuccessful technologies have, in favor of proven technologies that take advantage of Moore’s law in order to solve
increasingly difficult problems.

2.2.5

Custom Integrated Circuits
The architectures proposed in the preceding sections all have advantages and disadvan-

tages. None of them is a perfect solution for a particular machine vision application. Much of
the inefficiency of these platforms stems from the fact that they are fairly general-purpose architectures for solving a wide range of problems. Most companies have made great efforts to make
their products suitable for different applications so as to increase the potential for product sales.
Unfortunately, this generality comes at a cost. Take for example the general-purpose computer (see
Section 2.2.1), perhaps the most general platform of all those discussed. The internal floating-point
representation on general-purpose desktop computers with Intel processors is 80 bits wide. There
is a relatively small number of applications that can take advantage of this wide floating-point format, yet to make the computer as generally applicable as possible the designers chose to include
it. Many machine vision applications can be implemented without using floating-point representations at all, and if the designer does choose to use floating point, a 32-bit representation is usually
sufficient.
Performance and efficiency disadvantages like these can be eliminated by designing a custom hardware circuit for the application and algorithms in mind. The bit width and format of
the computer’s data types can be carefully chosen in advance and tailored for the algorithm to be
implemented. Only the operators required for the algorithm need to be included and hardware
14

resource sharing can be implemented in a way that optimizes the solution for the intended application. Only when using custom hardware do we have complete flexibility in optimizing the system
for a specific application. Such a chip design that has been tailored for a specific application is
called an application-specific integrated circuit (ASIC).
The ASIC solution, unfortunately, comes with its own disadvantages. Creating a custom
chip using the latest technology is a very expensive undertaking. The non-recurring engineering
(NRE) costs required for chip fabrication and setup of the chip fabrication facility for the manufacturing of the chip is prohibitively expensive for all but high-volume or high-cost applications.
Other disadvantages of the ASIC solution include long turnaround time and lack of programmability. Any mistake in the resulting design can require that the design be fixed and that new chips
be fabricated, adding yet more time and cost to the project. If the application’s needs change or a
problem with the hardware is found at a later date, the ASIC cannot be altered. Instead, the chip
must be replaced with a redesigned version. This inflexibility is a major disadvantage of the ASIC
solution.

2.2.6

Summary and Future Evolution
From the perspective of overall physical characteristics, including performance, power con-

sumption, size, and weight, a custom ASIC will always present the best solution. However, for a
technology to be useful in real world applications we must consider other factors, such as cost,
flexibility, and ease of development, which make the ASIC a much less attractive solution, particularly in low volume applications such as machine vision. As a result, we must find a compromise
between the poor performance and flexibility of a general-purpose processor and the high performance and inflexibility of the ASIC. The computing platforms discussed in the previous sections
represent various levels of such a compromise. In addition to these platforms, there is another
platform that represents an effective middle ground between performance and programmability.
This platform, the field programmable gate array, or FPGA, will be discussed in some detail in the
following section.

15

2.3

The Field Programmable Gate Array
Another popular solution exists that represents a compromise between the high perfor-

mance, custom hardware ASIC and the comparably inefficient general-purpose programmable
processor. The field programmable gate array (FPGA) [29], [30] is a fine-grained, homogeneous,
programmable logic device that has been used extensively for ASIC prototyping and emulation
as well as a platform for high-performance computing [31]. The devices are also widely used in
industry for applications and consumer products requiring high-performance digital signal processing, such as video processing applications.
Due to the popularity of FPGAs for a variety of applications, these devices can be sold
in high volumes, which makes them readily available at a fairly low cost. From an economic
standpoint, this makes them ideal for use use in low-volume applications, such as machine vision
systems, since they provide behavior like that of a custom ASIC but do not incur the high initial
NRE costs of ASIC fabrication. Additionally, they are completely reprogrammable, allowing a
system using them to be tailored to the application or to be modified as needed for evolving conditions. FPGA technology, its advantages and disadvantages, as well as how FPGAs can be used to
create a machine vision system will be discussed in this section.

2.3.1

Overview of the Modern FPGA
Although many different FPGA architectures have been proposed, and the architecture

and terminology vary between manufacturers, FPGAs generally consist of a homogeneous, twodimensional array of configurable logic blocks, with logic blocks being interconnected horizontally
and vertically by a programmable interconnect fabric. This programmable interconnect allows for
data communication between logic blocks. Each logic block, in turn, consists of a small number
of logic elements. A logic element (LE) generally consists of one or two small lookup tables
(LUTs), each with a corresponding flip-flop. In many FPGA devices, each LUT is simply a 16bit memory with a 4-bit input and a 1-bit output that can be used to implement any 4-input logic
function, although other LUT sizes and configurations have become popular as CMOS technology
has scaled. In addition to the LUTs and flip-flops, the logic elements typically also contain other
control and carry logic. The control logic allows each logic element to be used in different ways,

16

as needed. For example, control logic is present that makes the use of the flip-flop optional. That
is, the output of the LUT may be fed into the flip-flop, creating synchronous logic, or it may bypass
the flip-flop, creating purely combinational logic. Other logic is present in each logic element to
accelerate specific arithmetic operations, such as addition, by providing a more direct path between
logic elements over which carry logic can propagate. Another essential component of the FPGA is
the input/output (I/O) blocks that allow the programmable logic inside the FPGA to connect to I/O
pins on the device.
In addition to the array of logic blocks, today’s FPGAs often contain a variety of other
embedded hardware features that increase performance, reduce power consumption, and generally make FPGAs more suitable for a wider range of applications. Such features include onchip memory, hardware multipliers, hardware blocks tailored to common DSP operations, serializer/deserializer (SERDES) hardware for communication via high-speed serial interfaces, Ethernet
media access controllers (MACs), and even on-chip, general-purpose processors integrated into the
FPGA fabric. Different FPGA models may include different subsets and numbers of these features,
but nearly all include some number of on-chip memory blocks and hardware multipliers. In effect,
over time, FPGAs have become increasingly heterogeneous.

2.3.2

Advantages and Disadvantages of FPGA Technology
The contents of the LUTs, the behavior of the flip flops, the interconnect fabric, the behav-

ior of the I/O pins, and the behavior of all other blocks within the FPGA are highly configurable.
As a result of the general architecture provided by the FPGA, these devices are capable of emulating virtually any custom logic circuit, provided that the FPGA has a sufficient number of logic
blocks and other resources. For this reason, FPGAs are often used as prototyping platforms for
ASIC designs. However, the large amount of configurable hardware resources also allows them
to achieve very high levels of parallelism, making them well suited to many high-performance
computing applications, including image processing. FPGAs have been shown to provide much
higher data processing rates than programmable processors for certain kinds of computations (see,
for example, the comparisons in [32] and [33]), particularly those involving a large amount of parallelism. Additionally, FPGAs do not necessarily suffer from the many inefficiencies associated
with software execution, such as instruction fetching, branching, loading, and storing. Low power
17

consumption can also be a significant advantage of FPGAs when compared to general-purpose
processors. A typical FPGA design might consume orders of magnitude less power than a desktop computer processor, while providing superior performance. Together, all these performance
characteristics make FPGAs particularly well suited to embedded vision systems [34].
The programmability of FPGAs also gives them several advantages compared to the ASIC
solution. Due to its programmability, a design implemented in an FPGA is not fixed, but can be
modified as needed or even tailored to specific conditions. Changes or fixes can be integrated into
a design simply by downloading a new configuration to the FPGA. This download process takes
just a few seconds at no cost, as opposed to the weeks and high costs associated with having a new
iteration of an ASIC design fabricated.
The programmability of FPGAs also opens up new possibilities for the development and
debugging of a system implemented on an FPGA. Custom logic can be temporarily added to an
FPGA for testing purposes. Key signals and buses can be sampled and stored in the FPGA for later
evaluation. Potential enhancements to a design can be quickly implemented, tested, and evaluated.
After testing is complete, any additional testing or debugging logic can be removed to reduce
power and allow a smaller, less expensive FPGA to be used in the final implementation. This kind
of flexibility makes it very inexpensive and advantageous to develop and test designs using FPGA
technology, when compared to designs using an ASIC.
As with all devices, FPGAs of course have their disadvantages. This high-level of configurability comes with significant costs. When compared to a custom ASIC design built using similar
fabrication technology, the FPGA implementation will require much more chip area, have a lower
maximum operating frequency, and consume far more power. One recent study [35] showed that
the silicon area required for an FPGA implementation of a circuit was on average 18 to 35 times
greater than that of implementing the circuit in an ASIC at the same technology node, depending
on the extent to which the design could utilize the embedded hardware blocks of the FPGA. The
study showed that designs that could take advantage of the FPGA’s embedded hardware blocks,
such as hardware multipliers and on-chip memory, were the ones that used significantly less area.
Much of the extra area required for an FPGA implementation is due to the large amount of programmable fabric that interconnects the logic blocks. In addition to increased area, the study found
that the average critical path delay for the FPGA was from 3.0 to 3.5 times longer than that of the
18

ASIC, meaning that the ASIC could be clocked 3.0 to 3.5 times faster than an FPGA manufactured
with the highest speed grade. Dynamic power consumption was 7.1 to 14 times higher for the
FPGA, again depending on the use of embedded hardware blocks.
Another disadvantage of FPGAs, when compared to general-purpose programmable processors, is the long development time associated with custom hardware. This longer development
time is largely due to the lower level of abstraction at which custom hardware circuits are typically
designed. A digital hardware designer often works at the bit level and must consider the simultaneous interaction of a large number of circuits due to the substantial amount of concurrency that
exists in hardware execution. Additional details such as scheduling, pipelining, synchronization,
clocking, skew, fanout, and power issues must also be considered. On the other hand, a software
designer generally thinks about computation at a much higher level, in terms of functions, variables, and operations and, as a result, many of the details that plague hardware design can be
blissfully overlooked by the software designer. The FPGA, despite its flexibility, manifests similar
disadvantages regarding debugging when compared to general-purpose programmable processors.
Due to the difficulties of hardware development, there has been a significant effort to develop a variety of tools that allow engineers to develop hardware circuits at a higher level of abstraction. Although great progress has been made, these tools come with their own disadvantages.
As a result, most digital design is still done using traditional hardware description languages such
as VHDL and Verilog.

2.3.3

Characteristics of FPGA Processing
Despite the performance advantages that can be realized using FPGAs, these devices are

not a perfect match for all types of computer processing. Similar to many DSPs, FPGAs are not
well suited to complex control processing, where the desired system behavior can’t be described
in terms of a few simple, well defined equations. Algorithms that are well suited to processing on
FPGAs have the following characteristics:
• Substantial parallelism. This is the primary source of the FPGA’s performance advantage.
Due to the programmable hardware resources available, a single FPGA can implement hundreds, or perhaps thousands, of hardware processing blocks. Operating in parallel, these
19

functional units can deliver substantial overall throughput. Perhaps the most common way
to exploit the parallelism available in FPGAs is by creating long processing pipelines, which
exploit temporal parallelism. Much like the assembly line for the construction of an automobile, each stage in the data processing pipeline performs a small step in the processing
before passing the result along to the next pipeline stage. With all stages operating in parallel, substantial throughput can be achieved. Another way to exploit FPGA resources is
through spatial parallelism, where independent hardware blocks operate on separate data
streams. In order for an algorithm to benefit from FPGA implementation, large amounts of
parallelism must be inherent in the algorithm.
• Streaming data flow. To take full advantage of heavy pipelining, the data flow in the algorithm to be implemented should be for the most part streaming in nature. Algorithms that
require large amounts of random memory access (i.e., gather and scatter operations) tend
to be less efficiently implemented using FPGAs. Although these algorithms can be implemented using FPGAs, they tend to be memory or I/O bound, which may become the limiting
performance factor. Additionally, there is significant hardware overhead required to implement a complex memory interface, which takes away from the hardware resources available
for implementing the algorithm’s computation.
• Simple computation. FPGAs are not well suited to algorithms that require a large number
of different and complicated computations. Examples include high-precision floating-point
operations, logarithms, trigonometric functions (e.g., sine, cosine, etc.), square root, and
division. When necessary, these operations are often implemented using a lookup table
due to the large amount of hardware required to compute them. FPGAs are much better
suited to simple arithmetic operations, such as addition, subtraction, and multiplication, using fixed-point arithmetic. FPGAs are also well suited to Boolean operations and complex
bit manipulations.
• Strong data locality. Like virtually all computing platforms, FPGAs benefit greatly from
spatial and temporal locality. In particular, FPGAs benefit from data access patterns that are
highly regular. For example, to achieve the highest efficiency on FPGAs, the stride between
data accesses for each iteration of a computation should be fixed. This allows for very simple
20

buffering techniques to be used in place of complex memory subsystems. Fortunately, this
kind of regularity is typical of many signal processing computations, such as convolution,
and is very typical of image processing operations.
• Limited decision complexity. In a hardware implementation of a system, each decision to
be made is generally implemented as a separate hardware datapath. Therefore, if a system
needs to be able to perform any one of a large number of tasks based on the result of a
computation then significant amounts of hardware may be required to implement the various decision paths. Additionally, synchronization hardware may be required to synchronize
the results of the different paths. FPGAs are much better suited to algorithms where the
computation to be performed at each step is always the same or is one of a small number
of different possibilities. Complex control processing tends to be better implemented using
software-programmable processors.
Notice the strong correlation between the list of algorithm characteristics described above
and the list of general characteristics of machine vision and image processing discussed in Section 2.1.1. It is clear that the computations typical of the image processing operations in machine
vision systems are well suited to implementation using FPGAs.

2.4

Previous Work in Stereo Vision
Stereo vision is one of the most published topics in the machine vision literature. An ex-

cellent survey of the different stereo vision algorithms is provided by Scharstein and Szeliski [1].
A less exhaustive but more recent survey is given by Brown et al. in [36]. This section will introduce only the most relevant research in the field of stereo vision, as it relates to efficient, real-time
implementations.
Most stereo vision research has focused on improving the accuracy of the disparity map,
and therefore the 3D reconstruction. Relatively little research has been published on achieving
real-time performance with stereo vision algorithms and even less has been published on the use of
custom hardware for the implementation of stereo vision. Generally, when real-time performance
is a concern, a simple SAD correlation method is employed (see Section A.3 for a definition of
SAD). The SAD metric is relatively efficient since it only requires simple arithmetic operations and
21

gives results that are nearly as good as other more computationally expensive similarity measures
(Section B.3.1).
Additionally, SAD lends itself to optimizations. One common SAD optimization is to
reuse the previous SAD computation for the adjacent window, as described in [37]. When this
is done, it is only necessary to add the SAD computation for the column on the leading edge of
the window and subtract the SAD computation for the trailing edge of the window. This reduces
the computational complexity of SAD from O(M 2 N 2 d) to O(M 2 Nd). Similar optimizations can
be applied in the vertical direction, at the expense of additional storage, further simplifying the
complexity to O(M 2 d).
Another common SAD optimization, when implemented on a PC, is to make use of SIMD
instructions to parallelize the SAD computation. This technique has been used by many researchers
and is described in detail by van der Mark and Gavrila [38] as well as Di Stefano, Marchionni, and
Mattoccia [39].
Despite its popularity for real-time implementations, SAD performs poorly near object
boundaries and is very sensitive to common stereo image imperfections, such as radiometric distortion (i.e., different pixel gain or bias between cameras), vignetting, and camera noise. Several
attempts have been made to improve upon SAD, many of these emphasizing low computational
complexity.
One of the most important improvements that can be applied to any correlation method
is commonly called the left-right consistency check (LRCC), which appears to have been first
proposed by Fua [40]. To perform this check, the stereo correlation algorithm is first performed as
usual. Then the roles of the right and left cameras are reversed (i.e., the reference image becomes
the search image and vice versa) and the stereo correlation is repeated. The two disparity maps are
then checked for consistency—if pixel (xl , yl ) in the left image has disparity d then pixel (xr −d, yr )
in the right image must have disparity −d. Any pixels for which the disparity estimates differ are
rejected. This check is surprisingly effective, and is particularly good at eliminating incorrect
matches due to occlusions. In fact, Gautama et al. compared several error filtering methods and
found the left-right consistency check to be the most reliable and efficient [41]. This result seems
to have been confirmed by the work of Banks and Corke [42].

22

The naive implementation of the left-right consistency check performs the entire stereo
correlation algorithm twice. However, since the correlation with camera roles reversed performs
no new computation, it is possible to reduce the computational cost by reusing information from a
single stereo correlation run, at the expense of intermediate storage. Additionally, Di Stefano et al.
proposed an alternative method that has a lower computational cost and relies on the uniqueness
constraint [37]. When using this method, each time a candidate pixel that has been matched before
is compared to the current template window, the resulting similarity measure is compared to the
previous match. If the new similarity measure is better than the previous match then the old match
is rejected. This enforces the uniqueness constraint by ensuring that each pixel is only matched
once, and it gives the system the ability to recover from previous bad matches. Unfortunately, the
uniqueness constraint is not as strong as the left-right check and tends to accept more incorrect
matches.
One of the most important works published on improving area-based correlation methods
was that of Kanade and Okutomi on the adaptive window (AW) correlation method [43]. In this
method, a statistical model of intensity and disparity variation in the template window is used in
order to evaluate the suitability of the window size and shape being used for correlation. Several
window shapes and sizes are tested in order to find the best window size to use for each pixel correlation. For simplicity, only rectangular windows are used in practice, although the method extends
to arbitrary window shapes. Additionally, the method extends beyond SAD to other area-based
correlation methods. This method provides significant improvement to disparity map accuracy,
particularly at object boundaries where the window overlaps a region of non-constant disparity.
Unfortunately, the intensity and disparity variation calculation and the requirement to evaluate the
quality of several window shapes for each correlation makes the method computationally expensive.
The work of Kanade and Okutomi inspired other methods involving multiple windows. A
related work is that of Fusiello et al. in which they describe the symmetric multiple windowing
(SMW) method [44]. In this method, nine correlation windows, as shown in Figure 2.1, are used
instead of just one centered at the pixel location being compared. Only one of the nine windows
is actually selected and used for the similarity measure. The window yielding the best similarity
measure is always chosen, since this window is most likely to contain an area of constant disparity.
23

This method tends to result in less disparity error and usually requires less computation than the
adaptive window of Kanade and Okutomi. However, since the computed similarity measure does
not always include the region immediately surrounding the center pixel, it tends to result in more
errors in regions of constant disparity (see Section B.3.5).

Figure 2.1: SMW correlation windows. The dark pixel indicates the pixel for which the
similarity measure is being computed.

A greatly improved technique, which will be referred to as the multiple supporting windows (MSW) method, was proposed by Hirschmüller [45]. Like the SMW method of Fusiello,
Hirschmüller’s method employs multiple windows. However, several different window configurations, shown in Figure 2.2, were proposed. Additionally, unlike Fusiollo’s method, Hirschmüller’s
method always includes the similarity error measure of the center window and adds the error of
a subset of the surrounding windows. As it turns out, the inclusion of the center window is key
to achieving good results throughout the scene. In addition to the MSW method, Hirschmüller
proposed a simple general error filter, to identify matches based on areas of little or repetitive texture, and a border correction filter, to improve the results at disparity discontinuities. The most
important aspect of Hirschmüller’s proposed method is that it adds relatively little computational
cost to the correlation algorithm while providing significantly improved results.
Other work has focused on the identification of new classes of correlation methods to replace the traditional area and intensity-based approaches. Zabih and Woodfill proposed the rank
transform and census transform, both non-parametric, local transforms, which rely on the relative
24

Figure 2.2: Hirschmüller’s MSW correlation windows. Left, a configuration with
five equally sized overlapping windows, where the outer four windows overlap
by one pixel. Center, nine non-overlapping windows. Right, 25 non-overlapping
windows.
ordering of local pixel intensity values rather than the pixel values themselves [5]. The reliance
on the relative ordering of intensity values gives them high resistance to radiometric distortion,
vignetting, and noise, as well as other sources of outliers [5], [46], [47]. In particular, because
these transforms do not weigh how different pixel intensities are, they have a high tolerance to
mismatches caused by disparity discontinuities [5].
The rank transform of a pixel p is defined as the number of pixels in the local neighborhood
whose intensity is less than the intensity of p. Or mathematically, the rank transform of a pixel,
R(p), can be written as
R(p) =

∑
0

ξ (p, p0 ),

(2.1)

p ∈W (p)

where W (p) represents the neighborhood (e.g., an N × N window) centered about pixel p and
I(p) represents the intensity of pixel p. Note that [5] uses a different formulation where the rank
transform is defined as the cardinality of the set where the conditions of ξ are satisfied. The version
here is equivalent and makes the similarity between the rank and census transforms more clear. The
function ξ (p, p0 ) is defined as follows:
ξ (p, p0 ) = 1 if I(p0 ) < I(p),

(2.2)

0 otherwise.
Since ξ (p, p) is always zero, the center pixel is generally ignored when computing the rank transform.

25

The census transform of a pixel p is defined as the bit string representing the set of neighboring pixels whose intensity is less than that of p. Or mathematically, the census transform of a
pixel, C(p), can be written as
C(p) =

O

ξ (p, p0 ),

(2.3)

p0 ∈W (p)

where

N

represents the concatenation operator and the values are always concatenated in the same

order. Again, since ξ (p, p) always equals zero, the bit corresponding to the center pixel is not
usually included in the bit string.
Both of these transforms can be thought of as preprocessing filters that are applied prior to
correlation with standard error measures, such as SAD. However, the census transform does not
produce a value that can be used directly in a SAD correlation, but instead produces a bit string. To
deal with this situation, SAD is modified to use Hamming distance instead of absolute difference,
resulting in a similarity measure that will be referred to as SHD (Sum of Hamming Distances).
Recall that the Hamming distance between two bit strings is defined to be the number of bits that
differ between the two strings. The SHD similarity measure can be written as
R

SHD :

R

∑ ∑



Hamming C1 (x1 + i, y1 + j),C2 (x2 + i, y2 + j) ,

(2.4)

i=−R j=−R

where Cn represents the left and right census-transformed images.
A few researchers have attempted to improve upon the work of Zabih and Woodfill. Gautama et al. observed that the value of the rank and census transforms depends heavily on the value
of the center pixel [41]. This makes them more susceptible to high-frequency noise (e.g., salt and
pepper noise). They proposed using the mean of the window rather than the value of the center
pixel when computing the rank and census. Of course, this adds to the computational cost of both
transforms, and it is not clear from their work how much improvement it provides.
Bhat and Nayar defined two new ordinal measures, called κ and χ [47]. However, these
measures suffer from low discriminatory power, a common defect in ordinal measures. To improve
upon this, Scherer et al. introduced a normalized ordinal correlation coefficient [48]. These new
ordinal measures, although interesting, introduce much higher computational costs, which seem to
outweigh their correlation improvements.
26

In a related line of research, Banks et al. introduced the rank order constraint, which must
be satisfied for correct matches, to improve the reliability of correlation involving the rank transform [49], [50]. Essentially, the rank constraint enforces consistency in the intensity ordering with
respect to the center pixels of the template and candidate windows. Use of the constraint can aid in
the selection of the correct match among several pixels having a comparatively similar SAD score.
Unfortunately, this algorithm also introduces significant computational overhead.
One of the most important shortcomings of much of the existing work is the lack of good
comparison between the different preprocessing, correlation, and error filtering methods most suitable for real-time implementation. The primary reason for this shortcoming has been the lack of
good image datasets on which to test and compare methods. Given a pair of stereo images, it is generally not known what the actual disparities should be, so there is no basis for comparison between
different methods using the image pair. Some of the most common stereo images used to evaluate
stereo vision systems are those from the JISCT (JPL-INRIA-SRI-CMU-TELEOS) and CMU-CIL
(Carnegie Mellon University Calibrated Imaging Laboratory) stereo image collections, of which
several samples are shown in Figure 2.3. These are relatively old, low-resolution, grayscale images with no disparity truth information available. Most authors who use these images rely simply
on qualitative analysis (i.e., a visual inspection) of the resulting disparity images. Other authors
have resorted to less meaningful measures, such as the percentage of pixels that were matched by
a stereo algorithm or the percentage that were later rejected by an error filter. Of course, these
metrics are only useful if the matched pixels are correctly matched and the rejected pixels were
incorrectly matched.
In the past, the most common source of quantitative comparison has been the use of synthetic images, often in the form of random dot stereograms, such as Figure 2.4(a). Unfortunately,
such images are not at all representative of the real world. Some synthetic images, such as Figure 2.4(b), have instead been generated by a 3D rendering program. Other images have been
created by combining planar objects in a scene, allowing ground-truth disparities to be easily calculated, such as Figure 2.4(c). For some of the early real images, such as Figure 2.4(d), select
image points were matched by hand, creating a very sparse ground-truth disparity map. The earliest real stereo image dataset with complete ground-truth data is the Tsukuba image dataset, shown
in Figure 2.5(a).
27

(a) “Trees”

(b) “Shrub”

(c) “Parking meter”

(d) “Rocks”

Figure 2.3: Common stereo image datasets without ground-truth disparity information.
Only the left image is shown here.

Fortunately, over the past few years, Scharstein and Szeliski, through Middelbury College,
have been generating and making publicly available a large number of high-quality stereo images
with accurate ground-truth disparity data [51]. For convenience, the rectification of these images
has already been performed. The true disparities for the most recent image sets were computed
using a very accurate method developed by Scharstein and Szeliski that uses structured light [52].
Most importantly, the availability of ground-truth disparity maps makes it much easier to quantitatively compare the results of various stereo vision algorithms using a variety of real stereo images,
whereas most previous works relied on one synthetic and/or one real image pair. As it turns out, re-

28

(a) Stereogram of a Square

(b) “Synth”

(c) “Map”

(d) “Castle”

Figure 2.4: Common stereo image datasets with ground-truth disparity data. Only the
left image is shown here. Image (d) has truth information for only a subset of the image
points.

sults can vary dramatically between image datasets, making results based on a single stereo image
pair far less meaningful.
The disadvantage of these new datasets is that they may not be representative of the images
you would expect to see in some applications. This fact has led some researchers to use synthetic
3D renderings of relevant scenes instead (e.g., [38]). However, Scharstein and Szeliski have proposed that four of the readily available stereo image sets, with ground-truth data, be used in the

29

comparison of stereo vision algorithms, and most researchers have followed suit. These four image
datasets are shown in Figure 2.5 with ground-truth disparity maps shown in Figure 2.6.

(a) “Tsukuba”

(b) “Venus”

(c) “Teddy”

(d) “Cones”

Figure 2.5: Stereo image datasets proposed by Scharstein and Szeliski. Only one camera
image of each dataset is shown.

The four datasets of Figure 2.5 will be used extensively in this work to evaluate the quality of correlation algorithms. However, because this is a fairly small set of sample images, an
additional four images, also provided by Scharstein and Szeliski, will be used to provide a wider
sample set of images. Additionally, because the selection of algorithm parameters is somewhat
of an optimization problem, the additional four images will allow us to separate the images into
30

(a) “Tsukuba”

(b) “Venus”

(c) “Teddy”

(d) “Cones”

Figure 2.6: Disparity ground-truth images for the proposed stereo image datasets.

a training set and an evaluation set. Therefore, the original set of images suggested by Scharstein
and Szeliski (Figure 2.5) can be used as the training set and the images of Figure 2.7 can be used
as the evaluation set for a given configuration. Figure 2.8 shows the ground-truth images for this
alternate set of images.

2.4.1

Comparing Correlation Accuracy
In the stereo accuracy comparisons of this dissertation, the differences between methods

are sometimes very small (e.g., 1% or less). This leads to the question of how much improvement

31

(a) “Rocks”

(b) “Baby”

(c) “Wood”

(d) “Cloth”

Figure 2.7: Alternate stereo image datasets provided by Scharstein and Szeliski. Only one
camera image of each dataset is shown.

in accuracy is statistically significant. Clearly, very small differences in correlation accuracy do
not guarantee that one algorithm will always perform better than another, nor does it guarantee that
the final performance of the stereo system will be noticeably improved.
The relative value of algorithms that demonstrate small differences in correlation accuracy
is better distinguished by the consistency of the results. For example, in cases where the correlation
accuracy is improved by a small amount, but is improved consistently with every image data set
tested, we can have a relatively high confidence that the improvement is real and is not an artifact

32

(a) “Rocks”

(b) “Baby”

(c) “Wood”

(d) “Cloth”

Figure 2.8: Disparity ground-truth images for the alternate stereo image datasets.

of the images or in simply in the noise. On the other hand, when the improvement only manifests
itself in a small number of images, this suggests that the improvement is tied to the characteristics
of those images and therefore should not be considered a general improvement.
Where possible in this dissertation, I have attempted to point out these situations in order
to clarify the significance of the results. As a general rule, I have found that improvements in
correlation accuracy of around 0.5% or less should be considered with greater scrutiny since this
small difference may not be significant.

33

To provide a frame of reference for what a small difference in correlation represents, consider the example of Figure 2.9. This figure shows the results of two different stereo algorithms
on the Venus test images. The accuracy of Figure 2.9(a) is only 0.43% higher than the accuracy of
Figure 2.9(b), yet the difference is readily apparent.

(a) 97.05% Accuracy

(b) 96.63% Accuracy

Figure 2.9: Example of a small difference in correlation accuracy. The accuracy of (a) is
0.43% higher than (b).

34

CHAPTER 3.

3.1

ALGORITHM SUITABILITY

Global Stereo Algorithms
The simplest stereo correspondence methods are generally referred to as local, or window-

based methods. In contrast, much of the more recent research in stereo vision has focused on
global methods. These methods make explicit assumptions about the smoothness of the disparity
map and pose the correspondence problem as an optimization problem. These algorithms then seek
to minimize a global cost function that combines such criteria as the disparity smoothness, point
order, geometric similarity, photometric similarity, and so forth. In this context, the simpler local
methods can be thought of as using a winner-take-all optimization approach, with the similarity
measure as the only comparison criteria.
As of this writing, the algorithms yielding the highest correlation accuracy model the disparity image as a Markov Random Field (MRF). For a more thorough introduction to such algorithms, see [53]. In these algorithms, each pixel in the disparity image is treated as a random
variable, x p , for the disparity corresponding to pixel location p. The random variable can take on
one of a finite number of discrete states corresponding to the possible disparities for that pixel.
Similar to local methods, this global method uses a simple compatibility function, Φ(x p , d p ) to
quantify the similarity between the intensity of pixel location p with the pixel at disparity x p in the
corresponding image having intensity y p . Where the methods differ is in the aggregation step.
The aggregation step is responsible for combining the information from multiple pixels
into a single cost for matching one pixel in one image to a pixel in another image. In a local
method, aggregation is performed by combining the costs for a window of pixels around the point
and its potential match in the two images, often by summing the costs over the window. This
assumes that the disparity is essentially the same over the area of that window and that no other
nearby pixels will give a similar cost measure. Unfortunately, in some images this is not always the
case. In contrast, the MRF approach aggregates information by introducing an additional function,
35

Ψ(xi , x j ), that expresses the compatibility between a variable, xi and an adjacent variable, x j . This
leads to the following joint probability for the MRF:
P(x1 , x2 , ..., xN , y1 , y2 , ..., yN ) = ∏ Ψ(xi , x j ) ∏ Φ(x p , d p ).
(i, j)

(3.1)

p

Taking the log of both sides makes it clear that maximizing the probability of Equation 3.1 is
equivalent to minimizing the following:
P(x1 , x2 , ..., xN , y1 , y2 , ..., yN ) =

∑ − log[Ψ(xi, x j )] + ∑ − log[Φ(x p, d p)].

(3.2)

p

(i, j)

In this formulation, the terms become energy functions and the problem is often thought of as
an energy minimization problem. The use of Ψ(xi , x j ) effectively creates a network of pair-wise
connections. Thus, each variable is able to influence all other variables in the field even though the
compatibility function only considers adjacent variables.
The two most common energy minimization algorithms are graph cuts [54] and the more
popular Bayesian belief propagation [55]. These methods obtain good approximations of the optimal solution of an NP-hard problem with speed that was not achieved previously.
In general, the belief propagation algorithm is O(Nk2 T ), where N is the number of pixels
in each image, k is the number of possible labels for each pixel, and T is the number of iterations
√
performed, which is typically N. Some optimizations have been developed which reduce the
problem to O(NkT ), where T is relatively small. However, a large amount of memory is required
to implement the algorithm [56].
In contrast, the most straightforward implementation of a local method would be no more
complex than O(NMd), where N is the number of pixels in each image, M is the number of pixels
in the correlation window, and d is the maximum disparity. In practice, through parallelization
and data reuse, most local methods becomes O(Nd). Thus, the computational complexity of the
belief propagation algorithm, with the most thorough optimizations, tends to scale similarly to
local methods.
Nevertheless, the execution time of the minimization algorithm is still very high in practice compared to the simpler local stereo methods, for several reasons. For example, the message

36

update algorithm is relatively complex, requiring a lot of execution control and irregular memory
access patterns. This makes the algorithm much more difficult to parallelize and execute efficiently. Performance is also lower due to the amount of time required to complete each update.
Furthermore, the complexity, difficulty of parallelization, and memory requirements make such
optimization problems a poor match for FPGA execution (see Section 2.3). In contrast, local
methods use highly regular, repeated computations with large amounts of inherent parallelism,
making them very well suited to real-time implementation, particularly when using custom logic
on a reconfigurable platform.

3.2

Suitable Algorithms for Real-time FPGA Implementation
There are several interrelated factors that affect the suitability of a stereo vision algorithm

for a given application. For example, the correlation accuracy determines how useful the depth map
is for each pair of images and directly impacts how difficult it is to extract meaningful information
from the depth map without erroneous results. The execution time will determine whether or
not the real-time requirements of the application can be met. The resource requirements have
even broader effects, determining the size, cost, and power consumption of the implementation.
The amount of resources used for the implementation is also related to the execution time of the
algorithm, with faster implementations requiring more resources.
As described in Section 3.1, the correlation accuracy is highest using state-of-the-art global
algorithms. However, other factors are made worse with these more complex algorithms. Many
real-time applications do not require the highest correlation accuracy, but may have strict requirements related to cost, size, power, or execution time. For these applications, local methods are
generally preferred.
The execution of a local method can be divided into a few steps. First, preprocessing
and/or image filtering is optionally performed to prepare the images for the stereo vision algorithm. Second, pixels are compared between the two images using some metric of pixel difference
or similarity. Third, the quantified similarity results are aggregated over some pixel window to
create an improved metric of the similarity between two pixels. These three steps typically involve
computation with the following characteristics:

37

• Highly repetitive computation repeated over windows of the two images
• A relatively small number of simple operations, such as addition, subtraction, and multiplication
• High data locality due to the windowed nature of the algorithms
• Independent data operations allowing for very high levels of parallelism
Note the similarity between the characteristics identified above and the characteristics identified as most suitable for FPGA processing in Section 2.3.3. This makes the FPGA a nearly ideal
platform for the implementation of local stereo algorithms.

3.3

Characteristics of Local Methods
A large number of local methods have been studied in the literature. The key differences

between them are how they perform the image preprocessing, the matching cost computation, and
the selection of the correlation window. For the most part, local methods rely on summation to aggregate the cost computation for the window followed by a winner-take-all optimization strategy
for selecting the correct disparity (i.e., the pixel with the lowest matching cost is chosen as the correct match). Additionally, a method may perform certain post-processing steps, such as disparity
refinement to filter out bad matches or add missing information.
Table 3.1 shows a list of some of the most well-known local stereo methods and their characteristics. The list includes sum of absolute differences (SAD), sum of squared differences (SSD),
normalized cross correlation (NCC), symmetric multiple windows (SMW) [44], LoG followed by
SAD, multiple supporting windows (MSW) [45], rank [5], and census [5].
Table 3.2 gives the correlation accuracy (i.e., the percentage of correctly matched pixels,
or 100% − B using the notation of Scharstein and Szeliski [1]) of the local stereo methods on the
benchmark datasets proposed by Scharstein and Szeliski (Figure 2.5). The parameters used to obtain these results, which are also shown in the table, were selected so as to maximize performance
on the four image sets. To some extent, this makes the results overly optimistic for general use
since the parameters have been tuned to this dataset.

38

Table 3.1: Classic Local Stereo Methods
Method
SSD
SAD
NCC
SMW
LoG→SAD
MSW
Rank
Census

Preprocessing
None
None
None
None
LoG
None
Rank transform
Census transform

Matching Cost
Squared difference
Absolute difference
NCC
Squared difference
Absolute difference
Absolute difference
Absolute difference
Hamming distance

Window
Square
Square
Square
Multiple square
Square
Center plus multiple square
Square
Square

Table 3.2: Performance of the Local Stereo Methods
Method
SSD
SAD
NCC
SMW
MSW
LoG→SAD
Rank
Census

Parameters
11 × 11 Window
13 × 13 Window
9 × 9 Window
33 × 33 Window, 9 Subwindows of 17 × 17
21 × 21 Window, 9 Subwindows of 7 × 7
11 × 11 Window, 7 × 7 LoG, σ ≈ 1.0
17 × 17 Window, 7 × 7 Rank
13 × 13 Window, 7 × 7 Census

Accuracy (%)
85.16
85.72
86.96
86.70
87.23
87.63
90.21
90.64

Although the correlation accuracy is very important, in a real-world application using a
hardware implementation, the hardware resource requirements needed to achieve a real-time implementation are just as critical. Based on the trade-off between the hardware resource requirements and the correlation accuracy of these algorithms, it is fairly easy to prune from this list the
methods that are less suitable for a real-time implementation using FPGAs.
For example, the SSD and SAD methods are very similar in their implementation. The
only difference is that SSD replaces absolute value with squaring when computing the dissimilarity
between pixels. As a result, additional multipliers are required to implement SSD, making it more
expensive. Despite the added resource requirements, the method has been shown to be no more
accurate than SAD. The difference is even more noticeable in noisy images where SSD tends to
overemphasize the differences between pixels due to the increased amount of image noise. From
the table, SSD appears to at least have the advantage of being able to achieve its best results with
39

a smaller window than SAD, but this is not as important as one might think for two reasons.
First, even when the SAD window size is reduced to match SSD, it still achieves higher accuracy
than SSD. Second, it has been known for some time that we can optimize the window summation
operation so as to make the size of the window irrelevant [57] (see also Section A.7).
NCC tends to provide some benefit in correlation accuracy over the simpler SAD and SSD
methods, but this comes at a great increase in computational complexity, including several multiplications, multiple window summations, a square root, and a division operation. Furthermore, we
can achieve better results with simpler methods. For example, simply applying the Laplacian of
Gaussian or rank transform filters before SAD can offer a significant boost in correlation accuracy.
Such filters can be implemented easily and very efficiently in custom hardware.
Of these classic algorithms, SAD is the simplest to implement and generally requires the
least amount of hardware. Yet, the rank method adds a relatively simple non-linear filter and
provides nearly the highest correlation accuracy of all the local methods. Furthermore, the rank
method has the additional benefit of reducing the number of bits required for the image representation, making the SAD correlation component less expensive to implement than traditional SAD.
Whether or not this actually makes the rank method more hardware efficient than SAD in achieving
the same level of correlation accuracy is a question that has not yet been answered in the literature,
but will be answered later.
The census method, which bears similarities to the rank method, provides the highest correlation accuracy and has been shown to be much more robust than the more traditional methods in
the face of common image defects. However, it is much more expensive to implement than the rank
method because of the large number of bits required to represent a reasonable census window (e.g.,
48 bits for the 7 × 7 census) and the use of Hamming distance rather than absolute difference to
determine pixel dissimilarity. The question of how much more hardware is required to implement
the census method has not been answered in the literature. Clearly, the rank and census methods
are based on sound principles that yield good results relative to other local methods. Much of the
rest of this dissertation is focused on the development of new algorithms based on those principles.

40

CHAPTER 4.

SPARSE NON-PARAMETRIC TRANSFORMS

The rank and census stereo methods have been shown to provide superior correlation accuracy among many local stereo methods. Additionally, they are particularly well-suited to custom
hardware implementations using FPGAs. This chapter will discuss new sparse transforms based
on the original census transform and rank transform that reduce the computational requirements of
the stereo algorithm while, in some cases, actually increasing the stereo correlation accuracy. The
reduced complexity of these algorithms will lead directly to reduced resource requirements for a
hardware implementation.
A few previous works have also proposed alternate census transforms that could be considered sparse. For example, Zabih, one of the inventors of the original census transform, proposed
the non-redundant census [2]. This transform will be discussed in Section 4.1.2. Later, an alternate
sparse census transform was proposed that was based on regular spacing of the included transform
points [58]. More recently, a simple sparse census has been used in software implementations [59].
However, none of these works has proposed optimizing transform point selection for the characteristics of typical images, which will be explored here. Furthermore, to my knowledge, a sparse
rank transform has not yet been proposed or analyzed.
Section 4.1 will describe the sparse census transform and provide quantitative correlation
accuracy results for optimized implementations of the transform. Section 4.2 will introduce the
sparse rank transform and provide quantitative results for sample implementations. Section 4.3
will confirm the quantitative results by presenting the stereo disparity images that result from
actual stereo images taken under real-world conditions. Finally, Section 4.4 will summarize the
results.

41

4.1
4.1.1

The Sparse Census Transform
Motivation
The principle disadvantage of the census transform stereo method is the large size of the

census vector (i.e., the number of bits). This size corresponds to the number of pixel comparisons
that are performed by the transform and directly affects the amount of hardware resources required
for the implementation of the stereo vision system.
Figure 6.1 (see Chapter 6 for a detailed discussion of this architecture and its hardware
resource requirements) shows a general correlation architecture that is well-suited to custom hardware implementation. This architecture is typical of actual stereo vision systems described in
the literature. The preprocessing blocks in the figure represent the hardware that would compute
the census transform. If we assume that the transform block is implemented as described in Sections C.2 and C.3 then the amount of logic required to implement the census transform computation
is roughly proportional to the size of the census vector.
The hardware resources required for the pixel distribution network is also directly proportional to the size of the census vector. This network feeds the similarity modules, also shown
in Figure 6.1, which implement the SHD computation. The SHD similarity measure uses the
Hamming distance to quantify the difference between two census vectors. Again, the hardware
resource requirement for the implementation of the Hamming distance computation tends to be
roughly proportional to the size of the census vector.
The size of the census vector also determines the size of the resulting Hamming distance
value. If an n-bit census vector is used then a dlog2 (n + 1)e-bit value is required to represent the
Hamming distance between two vectors. Thus, the census vector size also determines the size of
the addition circuits that implement the summing of the Hamming distances over the correlation
window, where the size of each adder is proportional to dlog2 (n + 1)e.
Furthermore, many stereo implementations will employ one of the summing optimizations
described in Section A.7. The width of the memories used to implement the optimization is directly
determined by the width of the Hamming distance value. The size of each select-best module in
Figure 6.1 is also influenced by the size of the Hamming distance value, since the number of bits

42

required to represent each window sum is a function of the correlation window size and the size of
the Hamming distance value.
Overall, the size of the census vector has a dramatic effect on the size of the stereo correlation system hardware. This provides a strong motivation for reducing the census vector size, if
it can be done without significantly affecting the accuracy of correlation. As we will see, not only
can we dramatically reduce the size of the census vector, but we can do so while simultaneously
improving the accuracy of stereo correlation for most scenes. This is an important result that has
never been shown previously.

4.1.2

Census Transform Redundancy
The census correlation method can be thought of as two distinct steps: the census transform

of each input image (Equation 2.3) followed by correlation using the SHD similarity measure
(Equation 2.4). However, it is interesting to consider the combined effect of the two.
When the two steps are combined, the overall effect is to compute dissimilarity by making
magnitude comparisons between two pixels in one image and then making the same magnitude
comparison between a potentially corresponding pair of pixels in the other image. If the relationship is the same (e.g., the relationships for both ordered pixel pairs are greater than) then zero is
added to the dissimilarity measure. If the relationship is not the same (e.g., one pixel pair shows a
greater than relationship while the other pair shows a less than or equal to relationship) then one
is added to the dissimilarity measure.
In order to better understand how the census correlation method works, I introduce a new
method of visualizing the census transform that borrows from graph theory. Figure 4.1 shows the
comparisons made for each image pixel in the 3 × 3 and 5 × 5 census transforms. Each directed
edge in the graph represents a comparison made between two pixels. The red box in the center
represents the pixel location for which the transform is being computed. In effect, each directed
edge in the graph represents a bit in the census transform vector that equals 1 if the tail node pixel
is greater than the head node pixel and 0 if the tail node pixel is less than or equal to the head node
pixel.

43

(a)

(b)

Figure 4.1: Census transform graph visualizations. Graph edges represent comparisons made by the census transform. Thus, each directed edge represents one bit of
the census transform. The red box marks the center pixel for which the transform is
being computed. (a) The 3 × 3 census transform. (b) The 5 × 5 census transform.

The census transform is only the first step in the census correlation method. The next step
is to sum the Hamming distances (SHD) of the census transform vectors for the template and
candidate windows.
Figure 4.2 shows all the comparisons that are made when the 3 × 3 census transform is
combined with 5 × 5 SHD. In this graph, each node now represents a pair of potentially corresponding pixels, one from the template window in one image and one from the candidate window
in the other image. The blue box represents the 5 × 5 correlation window over which the SHD
is computed (i.e., the template and candidate windows visually overlapped). In effect, a directed
edge from a node a to a node b in Figure 4.2 now represents a single bit that equals 0 if a and b
have the same magnitude relationship in both the left and right images (i.e., if a > b or a <= b in
both images). Otherwise the bit equals 1. The SHD similarity measure is then simply the sum of
all the bits.
From this analysis we can see that, in general, an n × n census transform followed by m × m
SHD will combine the results of m2 (n2 − 1) pixel magnitude comparisons in the calculation of the
similarity measure for a single pair of pixels. In Section B.3.6 it was found that a 7 × 7 census
transform followed by 13 × 13 SHD correlation provided the highest accuracy on the test image
datasets. This configuration combines the results of 8,112 pixel comparisons in the computation

44

Figure 4.2: 3 × 3 census transform with 5 × 5 SHD correlation, combined
comparisons made for the similarity measure computation.

of the similarity measure for each pair of pixels. This tremendous consolidation of information is
what allows the census correlation method to perform so well.
A careful inspection of Figure 4.2 reveals a deficiency in the census stereo method. Notice
that each comparison between pixels within the SHD correlation window (the blue box) is effectively performed twice. For example, assume that we compute the census transform for pixel p,
which includes a comparison with a nearby pixel p0 . When we later compute the census transform
for pixel p0 it will also include a comparison with pixel p. The double inclusion of these relationships effectively causes them to be weighted twice as heavily as edges that go outside of the
correlation window. Also, the number of unique pixel relationships used in the computation of the
similarity measure, relative to the number of pixels in the transform neighborhood, is much lower
when this redundancy is present. This redundancy in the census transform method was first noted
by Zabih [2]. Zabih also described the first sparse census transform, which is designed to eliminate
this redundancy.
To create a sparse version of the census transform, we simply replace the window W (p)
in Equation 2.3 (the N × N neighborhood surrounding pixel p) with Ŵ (p), which is a sparsely

45

populated neighborhood about the pixel p, as shown in the equation
C(p) =

O

ξ (p, p0 ).

(4.1)

p0 ∈Ŵ (p)

Recall that

N

represents concatenation and ξ (p, p0 ) is defined to be 1 if I(p) > I(p0 ) and 0 other-

wise, where I(p) represents the intensity of pixel p.
Zabih observed that the redundancy can be eliminated if the neighborhood of pixels, Ŵ (p),
is chosen such that
[i, j] ∈ Ŵ ⇒ [−i, − j] 6∈ Ŵ .

(4.2)

Such a sparse census neighborhood is shown in Figure 4.4(b), where the dark pixels indicate those
that are included in the neighborhood Ŵ . The standard, non-sparse census transform neighborhood
is shown in Figure 4.4(a). Figure 4.4(c) will be discussed in the next section.
The sparse census transform has not been studied beyond this non-redundant form and the
accuracy of this transform when combined with SHD correlation has not been previously reported.
A large number of previously unknown sparse census transforms are also possible, both with and
without redundancies. A representative sampling of useful sparse transform neighborhoods will
be proposed and evaluated in the following sections.
There are also a few important characteristics of the non-redundant census transform neighborhoods that have not yet been discussed in the literature. One such characteristic is that they are
always asymmetric. For example, notice that the non-redundant neighborhood of Figure 4.4(b) has
more points on its right half than on its left half and more points on the top half than on the bottom
half. The net effect of this imbalance is that the SHD similarity measure for a pixel is no longer
centered about that pixel.
To illustrate, Figure 4.3 shows the combined effect of a non-redundant 3 × 3 census transform followed by 3 × 3 SHD correlation. Notice that twice as many comparisons are made along
the top and right edges of the correlation window as are made along the bottom and left edges. It
is well known that, in general, correlation accuracy is maximized when the correlation window is
centered around the point of comparison. In practice, if the SHD correlation window is sufficiently
large, the imbalance of the non-redundant transform may not have a significant effect on the accu-

46

racy of correlation. Nevertheless, Section 5.1 will introduce a new form of the census transform
that can eliminate such asymmetry.

Figure 4.3: Asymmetry caused by a non-redundant census neighborhood.
This examples shows a non-redundant 3 × 3 census transform combined
with 3 × 3 SHD correlation.

Fortunately, the non-redundant census transform does have a very desirable characteristic,
which is that it results in a dramatic reduction of the size of the census transform vector. By using
the non-redundant, sparse census transform neighborhood of Figure 4.4(b) we reduce the census
vector size from 48 bits to 24 bits, a 50% reduction. As a result, the resource savings in a custom
hardware implementation of the census stereo method will also be dramatic.

4.1.3

The Value of Non-Redundancy
The non-redundant census transform seems very valuable because it approximates the full

census transform yet allows for a dramatically reduced census transform vector size. Yet there is
no reason why other arbitrary census transform neighborhoods cannot be used, possibly with even
fewer points for an even smaller census vector size. An interesting question is then whether or not
the non-redundant transform neighborhood provides some advantage in terms of stereo correlation
accuracy over another arbitrary transform neighborhood with a similar number of points.
To test the value of non-redundancy, stereo correlation was performed on the image datasets
using the three census transform neighborhoods of Figure 4.4. Notice that each point in the uni47

(a)

(b)

(c)

Figure 4.4: Sparse census transform neighborhoods for comparison. The dark pixels
are those included in Ŵ in Equation 4.1. (a) Full 7 × 7 census transform neighborhood. (b) Non-redundant, sparse census transform neighborhood. (c) Uniformly
distributed, 50% sparse census transform neighborhood.

Table 4.1: Sparse Neighborhood Correlation Accuracy Comparison
Dataset
Tsukuba
Venus
Teddy
Cones
Average
Rocks
Baby
Wood
Cloth
Average

Full 7 × 7
90.92
96.93
87.03
87.69
90.64
90.74
92.31
89.31
87.16
89.88

Non-Redundant
90.60
96.91
87.03
87.71
90.56
90.74
92.26
89.29
87.16
89.86

Uniform Sparse
90.59
96.96
87.04
87.76
90.59
90.78
92.35
89.35
87.19
89.92

formly distributed sparse census transform neighborhood (Figure 4.4(c)) has a redundant counterpart. The transform also has the same number of points as the non-redundant neighborhood
(Figure 4.4(b)), results in a census transform vector of the same size, and therefore requires the
same amount of hardware resources to be implemented. The results of stereo correlation using the
three neighborhoods, when combined with 13 × 13 SHD, are shown in Table 4.1.
As can be seen, all three census transform neighborhoods provide essentially the same level
of accuracy. Thus, the non-redundancy, by itself, has little value in this particular case, other than

48

for reducing the size of the census transform, something performed equally well by the redundant,
uniformly sparse neighborhood.
Although the uniformly distributed sparse neighborhood has the same number of points
as the non-redundant neighborhood, it effectively uses fewer unique pixel comparisons due to its
redundancy. Thus, we might expect the results to be different if a significant amount of noise were
added to the images, due simply to the smaller sample set of unique image pixel relationships
used in the computation of the similarity measure with the uniformly sparse neighborhood. By the
same reasoning, the non-redundant sparse transform uses nearly the same number of unique pixel
relationships as the full census transform. Thus, we might expect the non-redundant transform to
provide accuracy closer to that of the full census transform in the presence of noise.
Table 4.2 shows the correlation accuracy when noise having a standard deviation equal to
2% of the pixel range is added to the images. This level of noise represents a moderate amount
that is readily visible when the image is viewed. As expected, the full census provides the best
results for each image dataset, followed closely by the non-redundant sparse neighborhood, then
the redundant sparse neighborhood. However, there is only a small variation in correlation accuracy between the three neighborhoods because they all use a sufficient number of points for this
moderate level of noise. The uniform sparse neighborhood, with the smallest number of pixel relationships being sampled, performs only slightly worse than the other neighborhoods. Even when
the noise is increased to high levels, such as a 4% standard deviation, the difference in correlation
accuracy between the non-redundant and redundant transforms is still less than one half of one
percent on the first four images.
The relatively poor performance of the Wood images when noise is added deserves some
explanation. The Wood images are unique among the test images in that there is relatively little
well-defined texture visible in the images from the grain of the wood. As a result, the noise
becomes more dominant compared to the wood texture, making pixel identification very difficult.
However, even with these images, where noise level dramatically affects correlation accuracy, there
is relatively little difference between the performance of the three neighborhoods.
As we will see in Section 5.1, non-redundancy becomes more important as the census
transform neighborhood becomes increasingly sparse and the images become more noisy. Yet,

49

Table 4.2: Sparse Neighborhood Correlation Accuracy Comparison with 2% Noise
Dataset
Tsukuba
Venus
Teddy
Cones
Average
Rocks
Baby
Wood
Cloth
Average

Full
82.83
81.11
77.44
85.93
81.83
88.68
79.73
57.65
84.71
77.70

Non-Redundant
82.70
81.10
77.34
85.91
81.76
88.62
79.49
57.49
84.70
77.57

Uniform Sparse
82.36
80.43
76.73
85.86
81.34
88.55
78.96
56.72
84.43
77.16

from this example we can conclude that, for typical image and transform sizes, non-redundancy in
a sparse transform has only a small effect on correlation accuracy.

4.1.4

Neighborhood Selection
When selecting the points that make up a census transform neighborhood, it would seem

best to make the transform neighborhood as dense as possible and to distribute the points of the
neighborhood evenly in order to maximize the accuracy of the subsequent correlation step. This
has been the assumption of the stereo vision community and, as a result, the non-redundant census
transform neighborhood is generally viewed as a good “approximation” of the full census transform. In reality, some points in the transform neighborhood are much more valuable than others
because they contribute more to the accuracy of the stereo matching. On the other hand, the inclusion of some other points can actually hurt correlation accuracy, particularly when the images are
noisy. By carefully choosing the census transform window, we can further reduce the size of the
census transform vector, without loss in correlation accuracy.
In order to determine which points of the census transform neighborhood should be included, we need to gain an understanding of how much each point in the transform neighborhood
contributes to the subsequent correlation accuracy. To make this determination, I computed the
correlation accuracy using a single point census transform followed by 13 × 13 SHD, the best performing SHD window size for the census method on the first four datasets (i.e., the training set).
50

For each test a different point was used in the census transform neighborhood and the correlation
accuracy was recorded, until all points within the 13 × 13 window had been tested.
The results of these correlation tests are shown graphically in Figure 4.5. In the images
of this figure, the brightness of each pixel corresponds to the correlation accuracy obtained by
using that point in the census transform neighborhood. The brighter the pixel, the more the pixel
contributes to the correlation accuracy. The brightness of each image has been normalized to better
illustrate the relative difference between pixels.
Two important observations can be made from these images in regards to choosing a census
transform neighborhood. First, the further away from the center a transform neighborhood point
is, the less it contributes to the correlation accuracy. This is quite logical, since distant points are
more likely to overlap a region of different disparity or be inconsistent with the corresponding
point in the other image due to image distortion. Second, and less obvious, we see that points near
the center of the transform window often do not contribute positively to the correlation accuracy.
This is particularly true in the Tsukuba and Venus images, and less so in the other images.
This lack of importance for the pixels near the center can be explained by the noise and
measurement imprecision that is inherent in all digitized images. Recall that the census transform
compares the intensity of the center pixel to the other pixels in the neighborhood of the center
pixel and only stores a single bit representing whether the center pixel is greater in magnitude or
not. In most real images, adjacent or nearby pixels tend to be very similar in intensity. As a result,
even a small amount of noise can change the value of the census transform bit representing the
relationship and make the magnitude relationship between a nearby pair of pixels different in one
camera’s image compared to the same pair of pixels in the other camera’s image.
The effect becomes more dramatic if we perform the same tests on images with an increased
amount of noise. Figure 4.6 shows the results of the same tests when performed on versions of the
images with noise added having a standard deviation that is 2% of the pixel range. Notice that the
dark area in the center has expanded, indicating that the points near the center of the neighborhood
are even less valuable on noisy images.
Based on these findings, we can better choose a sparse census transform neighborhood so
as to maximize the contribution made by each point that we include in the neighborhood. For the
highest correlation accuracy, it is undesirable to include the points near the center of the image
51

(a) Tsukuba

(b) Venus

(c) Teddy

(d) Cones

(e) Average

Figure 4.5: Contribution of census transform points in a 25 × 25 neighborhood. The
brighter pixels contribute more to the correlation accuracy than the darker pixels.

52

(a) Tsukuba

(b) Venus

(c) Teddy

(d) Cones

(e) Average

Figure 4.6: Contribution of census transform points in a 37 × 37 neighborhood
for noisy images. The brighter pixels contribute more to the correlation accuracy than the darker pixels.

53

(e.g., the 3 × 3 center for images with little noise, or a larger region for noisier images) since they
contribute little to accurate correlation. It is also undesirable to include points that are too far from
the center. In Section B.3.6 it was found that the 7 × 7 census transform neighborhood led to the
highest correlation accuracy. It can also be seen in Figure 4.5 that the majority of the bright pixels
fit within a 7 × 7 window. Thus, for the four image datasets, points beyond 7 × 7 need not be
included in the census transform neighborhood.
For noisier images, we can modify the points of a census transform neighborhood in two
ways to improve correlation accuracy. First, we can increase the number of unique image pixel
relationships that are used in the computation of the SHD similarity measure by arranging the
points of the transform neighborhood so that there are fewer redundancies. Second, we can expand
the distribution of points in the neighborhood to distance them from the center of the transform
window, making them less susceptible to noise, as suggested by Figure 4.6.
To test the accuracy of the sparse census transform, I have created several sparse neighborhoods, each with a different number of points, guided by the principles identified in this section.
Figure 4.7 shows the neighborhoods chosen for images with only a small amount of noise. These
neighborhoods were chosen as examples because they tend to lead to the highest correlation accuracy on the four standard test image datasets.
Figure 4.8 shows an alternate set of census transform neighborhoods that have been optimized for noisier images. To be more effective, the neighborhood points are spread further from
the center, as suggested by Figure 4.6. Also, non-redundant points were chosen to increase each
transform’s robustness to image noise.

4.1.5

Sparse Census Transform Correlation Accuracy
This section will employ the census transform neighborhoods proposed in Section 4.1.4 to

perform stereo correlation in order to evaluate their performance. The results for the training set of
images will be shown (Tsukuba, Venus, Teddy, and Cones), as well as the results of the evaluation
set (Rocks, Baby, Wood, and Cloth).
Table 4.3 shows the correlation accuracy using the sparse neighborhoods of Figure 4.7
when combined with 13 × 13 SHD, which provides the highest accuracy for the full 48-point

54

(a) 16-Point

(b) 12-Point

(c) 8-Point

(d) 4-Point

(e) 2-Point

(f) 1-Point

Figure 4.7: Sparse census transform neighborhoods.

(7 × 7) census transform. The correlation accuracy when using the full 48-point census transform
is also shown as a baseline for comparison.
The important thing to note here is that the sparse versions, from the 16-point version
down to around the 4-point version, tend to perform as well or better on average than the full
48-point census transform, despite the drastically reduced size of the census transform vector. It
may seem surprising that using a more sparse transform neighborhood often slightly improves
correlation accuracy. This is due to the removal of points near the center and toward the edges of
the transform window. These inner and outer points, if present, can actually decrease correlation
accuracy because of their inconsistency between stereo images due to image noise and image depth
discontinuities, respectively.
The correlation accuracy for the sparsest neighborhoods, such as the 2-point and 1-point
neighborhoods, can be marginally improved if we enlarge the correlation window beyond 13 × 13

55

(a) 16-Point

(b) 12-Point

(c) 8-Point

(d) 4-Point

(e) 2-Point

(f) 1-Point

Figure 4.8: Sparse census transform neighborhoods optimized for noisy images.

Table 4.3: Sparse Census Transform (Figure 4.7), Correlation
Accuracy Using 13 × 13 SHD
Dataset
Tsukuba
Venus
Teddy
Cones
Average
Rocks
Baby
Wood
Cloth
Average

Full 7 × 7

16-Point

12-Point

8-Point

4-Point

2-Point

1-Point

90.92
96.93
87.03
87.69
90.64
90.74
92.31
89.31
87.16
89.88

91.01
97.05
87.19
88.02
90.82
90.84
92.42
89.52
87.29
90.02

90.53
96.99
87.17
88.06
90.69
90.80
92.39
89.45
87.29
89.98

91.89
96.94
87.13
87.86
90.96
90.80
92.25
89.37
87.13
89.88

91.81
96.90
87.16
88.26
91.03
90.89
92.37
89.49
87.10
89.96

90.74
96.63
86.23
86.97
90.14
90.60
90.79
88.79
85.85
89.01

89.22
95.43
85.65
86.77
89.27
90.22
89.77
85.19
85.16
87.58

56

Table 4.4: Sparse Census Transform (Figure 4.7), Correlation Accuracy on
Images with 1% Noise Using 13 × 13 SHD
Dataset
Tsukuba
Venus
Teddy
Cones
Average
Rocks
Baby
Wood
Cloth
Average

Full 7 × 7

16-Point

12-Point

8-Point

4-Point

2-Point

1-Point

87.46
89.50
84.66
87.04
87.16
90.23
89.17
72.31
86.71
84.61

87.36
88.96
84.35
87.31
86.99
90.26
88.29
71.10
86.61
84.06

86.48
88.26
83.89
87.33
86.49
90.22
87.31
69.79
86.47
83.45

87.19
87.73
84.23
87.19
86.58
90.09
87.30
69.27
86.30
83.24

85.84
86.04
82.47
87.37
85.43
89.88
84.50
65.61
85.63
81.41

84.07
83.92
80.30
86.11
83.60
88.93
79.03
56.26
83.53
76.94

82.18
81.38
78.77
85.53
81.97
88.01
75.41
53.84
82.30
74.89

(e.g., to 89.92% correlation accuracy on average for the 1-point neighborhood with a 17 × 17 SHD
correlation window on the training dataset).
The disadvantage of the sparsest neighborhoods becomes apparent when noise is added
to the stereo images. Table 4.4 shows the results of correlation accuracy for the same transform
neighborhoods (Figure 4.7) when noise with a 1% standard deviation is added to the images.
Some of the neighborhoods that performed as well or better than the full 48-point transform neighborhood now perform noticeably worse, with the sparser neighborhoods being the most
affected by the increase in noise. This degradation in correlation accuracy is to be expected because of the reduced number of unique pixel relationships that are used in each similarity measure
computation, which is due to the sparseness of the transform neighborhoods. Wood, as expected,
suffers the most with noisy images because the relatively small amount of texture is easily overcome by noise.
We can improve the correlation accuracy on noisy images by using the transform neighborhoods of Figure 4.8, which have been optimized for noisier images by employing non-redundant
neighborhood points and by spreading the neighborhood points further from the center of the transform. The correlation accuracy using these revised census transform neighborhoods is shown in
Table 4.5. Here, the full 9 × 9 census transform is shown as a baseline for comparison since it is a
more appropriate size than the 7 × 7 for this level of noise.

57

Table 4.5: Revised Sparse Census Transform (Figure 4.8), Correlation
Accuracy on Images with 1% Noise Using 13 × 13 SHD
Dataset
Tsukuba
Venus
Teddy
Cones
Average
Rocks
Baby
Wood
Cloth
Average

Full 9 × 9

16-Point

12-Point

8-Point

4-Point

2-Point

1-Point

88.85
90.53
84.80
86.64
87.71
90.16
89.87
73.35
86.58
84.99

89.11
90.26
84.60
86.38
87.59
89.96
89.62
73.74
86.29
84.90

89.14
90.23
84.59
86.41
87.59
90.00
89.67
73.36
86.31
84.83

88.30
90.08
84.69
86.60
87.42
89.98
89.36
72.05
86.49
84.47

88.04
89.05
84.24
86.75
87.02
90.04
88.58
70.63
86.35
83.90

87.27
88.25
83.03
85.15
85.92
89.37
85.61
68.49
84.87
82.09

84.61
85.28
80.13
83.71
83.43
88.03
80.95
61.88
82.98
78.46

The results of Table 4.5 show that the revised transform neighborhood tends to perform
better than the original transforms on every image data set except Cones. Cones is a notable exception here because correlation accuracy is highest on this dataset with a smaller census transform
spread, as is shown in Section B.3.6. If we move the neighborhood points closer to the center,
the correlation accuracy of Cones is improved. For example, moving each of the points in the
2-point neighborhood closer to the center by one pixel improves correlation accuracy on Cones
from 85.15% to 86.13%.
More importantly, even in the presence of image noise, several of the sparse census transform neighborhoods tend to perform nearly as well as the full census transform if properly designed for the amount of noise expected in the images. Therefore we can achieve nearly the same
correlation accuracy while dramatically decreasing the hardware costs of implementation.
The optimal sparse census transform neighborhood is dependent on the amount of noise
found in the images. However, it should be noted that the optimal parameters for any stereo method
will depend on the noise characteristics of the images. Since image noise is largely a function of the
camera settings and lighting conditions, one way of dealing with this difficulty is to use a different
census transform neighborhood depending on the amount of light detected by the camera. This
kind of information can be inferred, for example, by reading the digital image sensor’s register
settings related to its automatic exposure control, which report such things as exposure time, pixel

58

Table 4.6: Revised Sparse Census Transform (Figure 4.8), Correlation
Accuracy on Noiseless Images Using 13 × 13 SHD
Dataset
Tsukuba
Venus
Teddy
Cones
Average
Rocks
Baby
Wood
Cloth
Average

Full 9 × 9

16-Point

12-Point

8-Point

4-Point

2-Point

1-Point

91.35
96.85
86.93
87.23
90.59
90.63
92.22
89.19
86.91
89.74

91.68
96.73
86.76
86.94
90.53
90.44
91.89
89.04
86.68
89.51

91.74
96.77
86.73
86.95
90.55
90.41
91.96
89.13
86.67
89.54

91.49
96.77
86.83
87.15
90.56
90.51
91.95
89.18
86.84
89.62

91.51
96.86
86.74
87.34
90.61
90.52
92.06
89.53
86.86
89.74

91.16
96.52
86.12
85.59
89.84
90.11
90.15
89.45
85.71
88.86

89.51
95.73
85.09
84.42
88.69
89.05
88.85
87.39
84.56
87.46

gain, and mean pixel brightness. An adaptive census transform, if implemented in the architecture
of Figure 6.1 using a fixed number of neighborhood points, would only affect the transform blocks,
which make up a relatively small amount of the hardware resources required by the complete
system for typical values of the disparity search range (d).
A simpler alternative is to choose a sparse neighborhood that provides a good compromise
of characteristics for a variety of noise conditions. For example, although not optimized for such
images, the more spread neighborhoods of Figure 4.8 perform nearly as well as the neighborhoods
of Figure 4.7 on images with little noise. For comparison, Table 4.6 shows the correlation accuracy
of the revised transform neighborhoods when used on the original dataset images with no noise
added. From the table we can calculate that, on average, the correlation accuracy was reduced
by less than 0.4% compared to Table 4.3 when using the revised neighborhoods on the noiseless
images. Thus, a reasonable compromise is to use a neighborhood that is optimized for a moderate
amount of noise that is typical of the expected camera characteristics and lighting conditions.
The examples presented here show that the size of the census transform vector can be
dramatically reduced, leading directly to significant hardware savings with little or no loss in correlation accuracy. For example, the 8-point sparse transform performs essentially as well as the
full 48-point transform, despite using only 1/6th the number of bits. For implementations requiring
the smallest amount of hardware, the 2-point neighborhood still performs better than almost all of

59

the other stereo methods discussed in Appendix B. Thus, the sparse census transforms proposed
here are extremely valuable in minimizing hardware costs.

4.2
4.2.1

The Sparse Rank Transform
Motivation and Definition
If we compare the equations for the rank and census transforms, as rendered in Equa-

tions 2.3 and 2.1, it is clear that the rank transform of a pixel is simply the population count of
the census transform vector (i.e., the number of bits that are set). The similarity of the rank transform allows it to benefit from sparse representations in much the same way as the sparse census
transform. That is, we can reduce the hardware requirements with minimal impact on correlation
accuracy for most images. The concept of a sparse rank transform has not been described in the
literature.
To create a sparse version of the rank transform, we simply replace W (p) in Equation 2.1
(the N × N window surrounding pixel p) with Ŵ (p), a sparsely populated neighborhood about
pixel p.
C(p) =

∑

ξ (p, p0 ).

(4.3)

p0 ∈Ŵ (p)

Again, ξ (p, p0 ) is defined to be 1 if I(p) > I(p0 ) and 0 otherwise, where I(p) represents the intensity
of pixel p.
The rank transform of a pixel is a scalar value representing the “rank” of a pixel’s intensity
among its neighbors. As a result, it does not require the Hamming distance measure to determine
pixel dissimilarity but can rely instead on the more conventional absolute difference. Hence the
use of SAD instead of SHD in the subsequent correlation step.
Thus, there are two key changes that differentiate the rank stereo method from the census stereo method. First, there is the aggregation of the census vector into a single scalar value.
Second, there is the use of SAD instead of SHD for the similarity measure. The former of these
two differences leads directly to hardware savings in a custom implementation of the rank stereo
method, as compared to the census method.

60

Given an n-point neighborhood (i.e., an n-bit census transform vector), the rank transform
requires a dlog2 (n + 1)e-bit scalar to be represented. Thus, the rank method can use fewer bits than
the census method to represent the transform of a pixel. As discussed in Section 4.1.1, the reduced
number of bits leads directly to hardware savings throughout the stereo correlation architecture.
Unfortunately, the change from Hamming distance (SHD) to absolute difference (SAD) in the
similarity modules does not, by itself, make a significant difference in the hardware requirements
because, on an FPGA, the absolute difference and Hamming distance computations require roughly
the same amount of hardware for the same input width, based on experiments performed in VHDL
using the Xilinx Synthesis Tool.
From a hardware implementation point of view, we can look at the difference between the
census stereo method and a rank stereo method as a redistribution of hardware in Figure 6.1. Essentially, to convert a census stereo implementation into a rank stereo implementation we move
the population count hardware in each similarity module (used for the Hamming distance) to the
preprocessing blocks where it is instead used to complete the rank transform. Thus, we replace
d population count blocks with two. Each of the remaining d XOR blocks (the other component of the Hamming distance computation), one in each similarity module, is then replaced by a
narrower absolute difference block. Thus, the amount of savings in the rank implementation, as
compared to the census implementation, will depend on the disparity search range (d) of the stereo
implementation as well as the number of points in the transform neighborhood.
One disadvantage of the rank transform method is that it tends to require a larger correlation
window than the census transform method to achieve its highest correlation accuracy. For example,
in Section B.2.4, it is shown that the optimal SAD correlation window size for the rank method
was 17 × 17, compared to 13 × 13 for SHD in the census method. Thus, the rank method can
lead to reduced logic requirements for computation but may require more memory in the similarity
modules to achieve the highest levels of correlation accuracy. The amount of memory will depend
on the particular summing optimization employed by the implementation (Section A.7) as well as
the correlation window size and the image width.
Unfortunately, as we decrease the number of points in the neighborhood (n), the hardware
savings brought about by the sparse version of the rank transform are not as dramatic as they
were for the census transform. For the census method, the size of the hardware was directly
61

Table 4.7: Sparse Rank Transform (Figure 4.7), Correlation
Accuracy Using 17 × 17 SAD
Dataset
Tsukuba
Venus
Teddy
Cones
Average
Rocks
Baby
Wood
Cloth
Average

Full 7 × 7

16-Point

12-Point

8-Point

4-Point

2-Point

1-Point

90.80
96.79
86.36
86.90
90.21
89.63
91.37
88.18
86.10
88.82

91.05
97.08
86.70
87.38
90.56
89.69
91.72
88.33
86.00
88.94

90.88
96.97
86.66
87.49
90.50
89.71
91.54
88.42
86.02
88.92

92.14
96.79
86.61
87.21
90.69
89.76
91.74
88.29
86.10
88.97

91.52
97.18
86.44
87.76
90.73
89.65
91.39
87.99
85.13
88.54

90.71
96.82
85.05
86.30
89.72
89.08
90.07
86.56
83.08
87.20

91.12
96.54
85.67
86.33
89.92
89.61
91.00
86.97
84.86
88.11

related to the number of points in the transform neighborhood. For the rank method, the hardware
requirements are more closely tied to dlog2 (n+1)e. For example, a 50% reduction in the size of the
neighborhood results in only a 1-bit savings in the size of the rank transformed pixel. Nevertheless,
the resource savings provided by the sparse rank stereo method, when compared to the original rank
method, are still significant while suffering little in correlation accuracy.

4.2.2

Sparse Rank Transform Correlation Accuracy
Table 4.7 reports the correlation accuracy for the sparse rank transform using the same

neighborhoods that were used for the census transform, shown in Figure 4.7, when combined with
17 × 17 SAD correlation. These neighborhoods work equally well for the rank transform because
the two transforms are fundamentally so similar. In fact, it should be noted that the 1-point rank
transform is equivalent to the 1-point census transform, although the correlation accuracy result
reported here is slightly higher than that which was reported for the 1-point census transform due
to the increased correlation window size used for the rank method.
As with the census transform, we see that many of the sparse rank versions tend to perform
slightly better than the traditional rank transform, since the selected neighborhoods eliminate many
less meaningful or inaccurate points. For example, the 16, 12, and 8-point neighborhoods perform

62

Table 4.8: Sparse Rank Transform (Figure 4.7), Correlation Accuracy on
Images with 1% Noise Using 17 × 17 SAD
Dataset
Tsukuba
Venus
Teddy
Cones
Average
Rocks
Baby
Wood
Cloth
Average

Full 7 × 7

16-Point

12-Point

8-Point

4-Point

2-Point

1-Point

87.26
89.50
83.29
86.12
86.54
89.01
86.66
70.47
85.05
82.80

87.60
89.24
83.11
86.47
86.60
89.02
84.93
67.95
84.28
81.54

87.26
89.08
82.77
86.52
86.41
88.99
83.20
67.42
83.79
80.85

87.44
88.41
82.40
86.23
86.12
89.04
83.94
66.87
83.89
80.94

84.18
87.04
80.14
86.29
84.41
88.25
77.29
60.12
81.32
76.75

84.30
85.59
78.51
85.02
83.36
85.99
75.55
48.18
78.95
72.17

85.56
86.33
81.45
85.71
84.76
88.57
83.37
57.40
83.40
78.19

as good or better than standard 48-point neighborhood for every image dataset tested and the
4-point is only slightly worse for two of the eight datasets.
As an example of hardware savings and correlation accuracy improvement, consider the
8-point neighborhood. This neighborhood requires 4 bits to represent the rank transform. The
standard rank transform reaches its peak accuracy using a 7 × 7, or 48-point, neighborhood, which
requires 6 bits to represent. As a result, the full rank transform requires a 50% higher bit width
compared to the 8-point sparse neighborhood while actually providing worse correlation accuracy.
As with any stereo vision method, increased levels of noise decrease the correlation accuracy achieved by the sparse rank transform method. As with the sparse census method, we can
improve the robustness of the sparse rank transform by carefully designing the transform neighborhood used.
Table 4.8 shows the correlation accuracy of the sparse rank transform in the presence of
image noise with a standard deviation equal to 1% of the pixel range. Table 4.9 shows the correlation accuracy using the revised transform neighborhoods of Figure 4.8, which were optimized for
increased levels of noise.
Although the census transform neighborhoods work very well for the rank transform, they
are suboptimal in the sense that most of them have a number of points that is a power of two. A
better trade-off between correlation accuracy and hardware resource requirements can be achieved

63

Table 4.9: Revised Sparse Rank Transform (Figure 4.8), Correlation Accuracy
on Images with 1% Noise Using 17 × 17 SAD
Dataset
Tsukuba
Venus
Teddy
Cones
Average
Rocks
Baby
Wood
Cloth
Average

Full 9 × 9

16-Point

12-Point

8-Point

4-Point

2-Point

1-Point

88.79
90.03
83.60
85.63
87.01
88.95
88.27
71.96
85.26
83.61

89.54
90.36
83.55
85.22
87.17
88.94
88.65
72.28
85.25
83.78

89.50
90.14
83.41
85.30
87.09
88.90
88.67
71.57
85.17
83.58

88.03
89.95
83.23
85.53
86.69
88.97
88.57
70.24
85.29
83.27

87.42
89.68
82.65
86.14
86.47
88.79
87.34
66.16
84.72
81.76

88.44
89.44
82.40
84.20
86.12
88.90
86.82
67.18
83.89
81.70

87.75
89.55
82.30
83.63
85.81
88.37
86.28
65.44
83.82
80.98

with the sparse rank transform when the number of points in the neighborhood is one less than a
power of two. For example, the rank transform using an 8-point neighborhood requires 4 bits to
be represented. Reducing the transform to a 7-point neighborhood would require only a 3-bit rank
transform representation. Similarly, the rank transform using the 16 and 4-point neighborhoods
can each be reduced by one bit if a single point of each neighborhood is discarded.
Table 4.10 shows the correlation accuracy for three of the sparse neighborhoods of Figure 4.7 when the top point of the rightmost column is removed from each neighborhood. This
reduces the number of points in these neighborhoods to be one less than a power of two, allowing
a one-bit reduction in the rank transform representation. Clearly we could design new neighborhoods that have more symmetry, but using these neighborhoods as an example demonstrates the
effect of losing a single point. By comparing the results with Table 4.7, we see that for all three
neighborhoods, the average correlation accuracy is not degraded by the removal of one point distant from the center. Similar modifications can be made to the transforms optimized for noisy
images to reduce the bit-width of the rank transform representation with little effect on correlation
accuracy.

64

Table 4.10: Bit-Optimized Sparse Rank Transform Correlation
Accuracy Using 17 × 17 SAD
Dataset
Tsukuba
Venus
Teddy
Cones
Average
Rocks
Baby
Wood
Cloth
Average

4.3
4.3.1

Full 7 × 7

15-Point

7-Point

3-Point

90.80
96.79
86.36
86.90
90.21
89.63
91.37
88.18
86.10
88.82

91.15
97.05
86.76
87.45
90.60
89.63
91.85
88.34
86.06
88.97

92.38
96.83
86.66
87.10
90.74
89.73
91.92
88.30
85.97
88.98

91.28
97.07
86.64
88.20
90.80
89.82
92.20
88.46
85.89
89.09

Qualitative Performance Analysis
Methodology
The Middlebury stereo image datasets [51] are generally accepted as the standard bench-

mark for quantitative stereo vision performance analysis. This collection includes a large number
of stereo images, including ground-truth disparity data, making for very easy evaluation of experimental algorithms. However, in some respects, these images may not be representative of
the challenges that a real stereo vision system must deal with. For example, the images in the
Middlebury data set were carefully taken with high-resolution, high-quality cameras. Lighting
was carefully controlled. The scenes were carefully chosen and assembled by hand. The camera
calibration and image rectification were also done with near-perfect accuracy.
In this section, we will test the algorithms presented in this chapter on a set of four stereo
image pairs taken under real-world conditions. Because all of the test images used thus far have
been indoor scenes, this section uses outdoor scenes. All pictures have been taken using a pair of
Micron MT9V403, monochrome, VGA image sensors with a global shutter. The cameras were
arranged in the canonical configuration with a 15-cm baseline. However, since the alignment was
done by hand with relatively little effort, there is significant image sensor misalignment. Rectified
versions of the stereo image pairs were produced using the method of [60].

65

Figure 4.9 shows the left image for each of the four image pairs that will be used to evaluate
the algorithms. These pictures were taken in a neighborhood setting on a very sunny, cloudless day,
creating intense highlights and dark shadows in some of the scenes.

(a) “Pole”

(b) “Hydrant”

(c) “Spillway”

(d) “Driveway”

Figure 4.9: Real-world, outdoor test images. These are the original unrectified images, as
output by the left camera.

Because there is no ground truth data for these images, the resulting disparity images are
presented for qualitative comparison only. All the resulting disparity images are normalized to
scale the pixel values to a range that makes them visible for display. In order to make the comparison easier, all stereo algorithms tested also employ the left-right consistency check (Section 2.4)
to remove some of the noise in the disparity maps that otherwise makes it more difficult to qual-

66

itatively compare the images. Virtually all real-world local stereo implementations will employ
such filters to remove or ignore disparity pixels which are obviously incorrect due to occlusion or
the aperture problem. No additional filters will be used, although many have been proposed in the
literature.
We can qualitatively compare disparity maps from different algorithms by looking for and
comparing specific disparity image characteristics, such as the following.
• Incorrect disparities. This includes large areas where where the disparity is clearly incorrect. These areas usually stand out because they have a significantly different intensity than
the surrounding areas, despite a lack of real objects in the original image at those locations.
• Missing disparities. This includes areas where the consistency check rejected incorrect
disparities. These areas appear as black areas in the disparity images shown. An algorithm
that results in a lot of disparity pixels being rejected is inferior to one that provides the correct
disparities.
• Smoothness. Planar objects should transition smoothly from one disparity level to the next
as the object gets further from the camera. As a result, smooth or flat objects, when correctly
handled by the stereo algorithm, should appear relatively smooth in the disparity image.
• Noise. Occasionally with local methods, we find very small spots or even single pixels where
the disparity is incorrect or has been rejected. Large amounts of such noise is undesirable.
Although in some cases, such as when there is relatively little noise, these areas can be
corrected by means of post-processing filters.

4.3.2

Accuracy Without Rectification
Image rectification is an important step for most local stereo methods. This step converts a

two-dimensional search for each corresponding pixel to a one-dimensional search, vastly decreasing the computational requirements of correlation. Rectification is generally assumed for most
stereo vision methods. The methods proposed in this chapter are no exception.

67

Although it is possible to carefully align cameras to compensate for their rotation and offset
so that a one-dimensional search can be used, it is very difficult to do so precisely. Additionally,
this does not compensate for the distortion caused by the camera lenses and sensor misalignment.
Figure 4.10 shows the disparity images that result when using the sparse census stereo
method with the 16-bit neighborhood of Figure 4.7(a) on the Pole image pair with various levels
of correction. Figure 4.10(a) shows the result of correlation on the original, unmodified images.
Very little of this disparity image appears to have the correct disparity. In Figure 4.10(b), we see
the result when the right image is translated so that it’s vertical center aligns within one pixel of
the center of the left image. This greatly improves the results, but there are still large portions
of the image that could not be matched correctly. Figure 4.10(c) shows the result when the right
image is translated and rotated to align with the left image. The quality of the results in this case
is further improved, but we still have not fully compensated for all of the image distortion caused
by an imperfect imaging system. This is the quality of results we might expect from a stereo
vision system where the cameras have been carefully aligned by hand but the images have not
been corrected for all forms of image distortion. Better results could be obtained with specialized
lenses and cameras designed to minimize distortion. Finally, Figure 4.10(d) shows the result when
the images are fully rectified.
Clearly, rectification is critical for the census algorithms. Results are similar for the other
local methods evaluated in the following sections.

4.3.3

Accuracy of Existing Local Methods
For the results in this section, four common local stereo methods have been used to produce

disparity images that can be compared with the results of the sparse census and sparse rank stereo
methods. The SAD and NCC methods (Section B.3.1) are traditional methods that are commonly
used as points of comparison with other methods. SMW and MSW (Section B.3.5) are more recent
developments. In particular, the MSW method of [45] is well-known as a relatively robust local
method that is well-suited for real-time stereo vision systems [1]. In addition to these methods,
this section will also show the disparity images for the original census and rank transform stereo
methods.

68

(a) Unmodified

(b) Translated

(c) Translated and Rotated

(d) Rectified

Figure 4.10: Resulting disparity maps using the 16-point sparse census with four different
examples of image correction.

Figures 4.11–4.14 show the results of these six methods on the Pole, Hydrant, Spillway, and
Driveway images, respectively. The parameters used for each algorithm are the same parameters
that were found in Appendix B to give the highest average correlation accuracy on the training
image set (Tsukuba, Venus, Teddy, and Cones). For reference, these parameters are shown in
Table 4.11.
Considering the quality of the original census and rank disparity images in Figures 4.11–
4.14, we can see that the results of both the original census and the original rank methods are
superior to the other four methods using the criteria of Section 4.3.1. However, it is not clear
whether the census or rank is superior. The results of the sparse methods, shown in the following

69

Table 4.11: Algorithm Parameters Used for Qualitative Comparison
Method
SAD
NCC
SMW
MSW
Census
Rank

Parameters
13 × 13 window
9 × 9 window
16 × 16 sub-window
9 × 9 sub-window, 9-window configuration
7 × 7 census, 13 × 13 window
7 × 7 rank, 17 × 17 window

sections, will generally be a degradation of the original census and rank methods, with the goal
of minimizing this degradation while reducing computational complexity and hardware resource
requirements to a level that rivals even the simplest stereo implementations.

4.3.4

Sparse Census Transform Accuracy
Figures 4.15–4.18 show the results using the sparse census methods. Based on the criteria

of Section 4.3.1, all the sparse census implementations down to the 4-point neighborhood appear to
outperform the four traditional methods shown in Figures 4.11–4.14. The 4-point method reduces
the transformed image down to pixels of 4 bits per pixel, resulting in significant hardware savings.
Even the 2-point sparse census, requiring 2 bits per pixel, appears to be roughly as good as the
four traditional algorithms, depending on the images and which criteria you consider to be most
important.
It should also be noted that the results of the 16-point sparse census, which uses 67% fewer
bits than the original census, appear to be at least as good as the original census. Additionally,
there is relatively little difference between the results of the 8-point sparse census and the original
census.
Compared to the quantitative results presented in Section 4.1.5, these disparity images
confirm that we can achieve very good results with sparse transform neighborhoods. For example,
the 4-point neighborhood uses only about 8% the number of points as the original census transform,
yet the results compare quite favorably with the original census as well as the traditional local
methods of Section 4.3.3. However, the very sparse neighborhoods, such as the 2-point and 1-point,
do not seem to perform nearly as well. For these very sparse neighborhood, the results seem to
70

correlate with the quantitative results obtained with noisy images. Our real-world images have
imperfections that result from noise, low-quality cameras, and imperfect rectification; defects that
were not present in the original Middlebury images. Such imperfections require more aggregate
image data to overcome and we see the effect of removing too much data in the 2-point and 1-point
disparity images.

4.3.5

Sparse Rank Transform Accuracy
Figures 4.19–4.22 show the results from the sparse rank method. The quality of the sparse

rank results is very similar to that of the sparse census transform, and one is not clearly better than
the other. However, it’s interesting to note that the 1-point sparse rank does appear to perform
better than the 1-point sparse census. Keeping in mind that the two algorithms are identical for a
single-point neighborhood, the only difference between the 1-point rank and 1-point census in this
case is the correlation window size, which is larger for the sparse rank example.
Also noteworthy is the fact that the 2-point sparse rank results are clearly worse than the
1-point sparse rank results. This was predicted by the quantitative results shown in Section 4.2.2.
The reduction in quality for the 2-point neighborhood is due to redundancy and a cancellation effect
of the rank transform that will be described in more detail in Section 5.2.2. Much better results are
obtained if we use a non-redundant 2-point neighborhood, such as the one in Figure 4.8(e).

4.4

Summary
This chapter has proposed a range of sparse census transforms to extend the census trans-

form previously described in the literature. These new sparse transforms reduce the computational
requirements of the census stereo algorithm. Additionally, this chapter has proposed the sparse
rank transform, which reduces the computational requirements of the rank stereo method, although
less significantly than the sparse census.
The characteristics of the best transform neighborhood have also been described, allowing
the transform to be fine tuned for the intended application and image characteristics. This allows a
balance between computational requirements, correlation accuracy, and robustness to be achieved

71

(a) SAD

(b) NCC

(c) SMW

(d) MSW

(e) Original Census

(f) Original Rank

Figure 4.11: Disparity images for several local methods on the Pole images.

72

(a) SAD

(b) NCC

(c) SMW

(d) MSW

(e) Original Census

(f) Original Rank

Figure 4.12: Disparity images for several local methods on the Hydrant images.

73

(a) SAD

(b) NCC

(c) SMW

(d) MSW

(e) Original Census

(f) Original Rank

Figure 4.13: Disparity images for several local methods on the Spillway images.

74

(a) SAD

(b) NCC

(c) SMW

(d) MSW

(e) Original Census

(f) Original Rank

Figure 4.14: Disparity images for several local methods on the Driveway images.

75

(a) 16-Point

(b) 12-Point

(c) 8-Point

(d) 4-Point

(e) 2-Point

(f) 1-Point

Figure 4.15: Disparity images for the sparse census stereo method on the Pole images.

76

(a) 16-Point

(b) 12-Point

(c) 8-Point

(d) 4-Point

(e) 2-Point

(f) 1-Point

Figure 4.16: Disparity images for the sparse census stereo method on the Hydrant images.

77

(a) 16-Point

(b) 12-Point

(c) 8-Point

(d) 4-Point

(e) 2-Point

(f) 1-Point

Figure 4.17: Disparity images for the sparse census stereo method on the Spillway images.

78

(a) 16-Point

(b) 12-Point

(c) 8-Point

(d) 4-Point

(e) 2-Point

(f) 1-Point

Figure 4.18: Disparity images for the sparse census stereo method on the Driveway images.

79

(a) 15-Point

(b) 12-Point

(c) 7-Point

(d) 3-Point

(e) 2-Point

(f) 1-Point

Figure 4.19: Disparity images for the sparse rank stereo method on the Pole images.

80

(a) 15-Point

(b) 12-Point

(c) 7-Point

(d) 3-Point

(e) 2-Point

(f) 1-Point

Figure 4.20: Disparity images for the sparse rank stereo method on the Hydrant images.

81

(a) 15-Point

(b) 12-Point

(c) 7-Point

(d) 3-Point

(e) 2-Point

(f) 1-Point

Figure 4.21: Disparity images for the sparse rank stereo method on the Spillway images.

82

(a) 15-Point

(b) 12-Point

(c) 7-Point

(d) 3-Point

(e) 2-Point

(f) 1-Point

Figure 4.22: Disparity images for the sparse rank stereo method on the Driveway images.

83

in resource-constrained systems. Additionally, methods for increasing robustness in the presence
of image noise have also been described.
Many of the sparse census and sparse rank transforms proposed in this chapter have been
shown to provide correlation accuracy that is nearly equivalent to the original census and rank
using standard test images, while some of the proposed transforms tend to outperform the original
census and rank on these same benchmarks. Quantitative accuracy results have been presented
using these standard image benchmarks as well as noisy versions of the benchmarks. Finally,
disparity images have been presented to confirm the quality of these algorithms under real-world
conditions through qualitative comparison.

84

CHAPTER 5.

GENERALIZED NON-PARAMETRIC TRANSFORMS

This chapter defines the generalized census and generalized rank transforms. These new
transforms, which are in fact supersets of the original and sparse census and rank transforms,
allow for additional flexibility in the selection of the transform neighborhood, leading to increased
robustness and reduced resource requirements in some configurations.
Section 5.1 will introduce and analyze the generalized census, providing stereo correlation accuracy results for several implementations of the transform. Section 5.2 will introduce the
generalized rank transform, demonstrate a few examples, and provide correlation accuracy results.
Similar to the previous chapter, Section 5.3 will provide qualitative results in the form of disparity
maps from real-world stereo images. Finally, Section 5.4 will summarize the results.

5.1

The Generalized Census Transform
Section 4.1.2 described the redundancy that exists in the traditional census transform. It

also described the non-redundant sparse census transform, which is designed to only include a set
of points in the census transform neighborhood that leads to non-redundant use of comparisons
in the correlation step. This section will introduce a new, more generalized form of the census
transform that will allow us to create sparse transforms that are non-redundant, symmetric, and
can be implemented using even fewer hardware resources. This leads to a very efficient version of
the census transform that has increased robustness to noise, delivers the same correlation accuracy,
yet requires less hardware to implement than the sparse census transforms of Section 4.1.

5.1.1

Motivation and Definition
One of the disadvantages of the census transform, as described in Section 4.1, is that it is

not possible to generate a non-redundant census transform neighborhood that is symmetric about
the center pixel. As a result, with a non-redundant transform, the pixel comparisons used in the
85

similarity measure computation are not centered symmetrically about the pixel for which the similarity is being computed. This side-effect is due to the fact that when the census transform of a
pixel is computed, the intensity of pixels in the neighborhood are always compared with the center
pixel.
Consider the 4-point census transform of Figure 4.7(d). Figure 5.1(a) shows the comparisons that are made by this transform neighborhood when computing the census transform.
Figure 5.1(b) shows the combined effect when this census transform is followed by 3 × 3 SHD
correlation. A 3 × 3 correlation is not large enough to provide good correlation results in general
and is used here for illustration purposes only.

(a)

(b)

Figure 5.1: 4-point census transform graphs. (a) Graph of census transform comparisons. (b) Comparisons when combined with 3 × 3 SHD correlation.

Notice that each pair of pixels compared in the transform that is within the inner 3 × 3 correlation window (outlined by the blue square) is compared twice. Clearly these are redundant comparisons, six of which can be removed without significantly affecting correlation accuracy. These
redundancies occur because the two vertical comparisons in the census transform (Figure 5.1(a))
are mirrors of each other. The same is true of the horizontal comparisons in the figure.
We can eliminate one of each pair of comparisons to create a non-redundant version of
the transform, as shown in Figure 5.2(a), but this results in the greatly asymmetric correlation of

86

Figure 5.2(b). Notice that the 3 × 3 region within the correlation window no longer has redundancies, but we have also lost the non-redundant comparisons on the bottom and right side. It is the
loss of comparisons outside the correlation window as well as the asymmetry of the comparisons
that causes the non-redundant transform neighborhoods to perform no better than the redundant
versions, as shown in Section 4.1.3.

(a)

(b)

Figure 5.2: Non-redundant 2-point versions of the census transform graphs.
(a) Graph of census transform comparisons. (b) Comparisons when combined with
3 × 3 SHD correlation.

The asymmetry in the graph of Figure 5.2(b) is a direct result of the asymmetry in the
census transform of Figure 5.2(a). An elegant solution to this problem is to shift the comparisons of
Figure 5.2(b) so that they are centered about the center pixel of the census transform neighborhood.
This results in the census transform shown in Figure 5.3(a). Notice that the comparisons of the
transform are no longer being made with a single common pixel. When combined with 3 × 3
SHD, we obtain the comparisons shown in Figure 5.3(b). This graph shows the same number
of comparisons as Figure 5.2(b), but they are now symmetric and centered about the correlation
window. Additionally, all comparisons are non-redundant, which was not possible for a symmetric
graph using the standard sparse census transform.
This example shows that if we remove the requirement that all comparisons be made with
the center pixel of the census transform neighborhood, we obtain a more generalized version of
87

(a)

(b)

Figure 5.3: Modified versions of the non-redundant 2-edge census transform
graphs. (a) Graph of census transform comparisons. (b) Comparisons when
combined with 3 × 3 SHD correlation.
the census transform in which any two neighborhood points can be used for each comparison. We
can mathematically define this new, more generalized version of the census transform as follows.
Let (c1 , c2 , . . . , cn ) and (c01 , c02 , . . . , c0n ) be finite sequences of coordinates for some n ≥ 1. The
generalized census transform of an image point p is then given by
CG (p) =

ξ (p + ci , p + c0i ).

O

(5.1)

1≤i≤n

5.1.2

Reduced Hardware Resource Requirements
A real-time architecture suitable for implementing the census transform is described in

Chapter 6. The amount of memory (excluding registers) required to implement the N × N window
buffer for the transform is proportional to M(N − 1), where M is the image width and N is the size
of the transform window. The number of registers required equals N(N − 1). Thus, a reduction
in the transform window size, N, will significantly reduce the amount of memory and registers
required to implement the window buffer.

88

One desirable property of the generalized census transform is that it improves the locality
of the pixels required to compute the census vector, allowing the transform to be computed using
a smaller window size and, therefore, a smaller memory buffer. Notice that the census transform
of Figure 5.1(a) requires a 5 × 5 window of pixels to be computed, whereas a generalized version,
shown in Figure 5.7(a), requires only a 3 × 3 window. As another example, consider the nonredundant transform neighborhood of Figure 5.4(a) (taken from Figure 4.8(d)), which requires an
8 × 8 window of pixels to compute. A generalized version of this neighborhood, shown in Figure 5.4(b), requires only a 5 × 5 window. In general, a full or sparse census transform of dimension
N (width or height) can be converted to a roughly equivalent generalized census transform of dimension of bN/2 + 1c.

(a)

(b)

Figure 5.4: Sparse and generalized census transform window buffer size
comparison. (a) Graph of non-redundant, 4-point, sparse census transform.
(b) Graph of generalized, 4-edge census transform.

Since all edges in the census transform graph can be moved closer to the center, the overall
width and height of the transform window can be reduced, minimizing memory requirements.

89

5.1.3

Characteristics of the Generalized Census Transform
There are several noteworthy characteristics of the generalized census transform. First,

the generalized transform is not guaranteed to be non-redundant, nor is it guaranteed to lead to a
similarity computation that is symmetric about the pixels being compared.
Figure 5.5(a) shows one example of a generalized census transform graph that contains
redundancies. These redundancies are visible in Figure 5.5(b), which shows the effect when the
transform is combined with 3 × 3 SHD correlation.

(a)

(b)

Figure 5.5: Redundant generalized census example. (a) Graph of a 4-edge
generalized census transform. (b) Effect when combined with 3 × 3 SHD.

In general, similarly to the original census transform, a redundancy will exist if two edges
of the transform graph are symmetric about the center pixel of the transform neighborhood. Put
another way, if an edge between c and c0 is in the transform graph then the edges from −c to −c0
and −c0 to −c must not be in the graph to avoid redundant comparisons.
Figure 5.4(b) shows an example of a graph that results in an asymmetric similarity computation. In order to guarantee symmetry, the generalized census transform graph must also be
symmetric about the center pixel. A single edge can only be placed symmetrically over the center of the transform window if both the vertical height and horizontal length are odd, since this
guarantees that the edge has a center point.
90

Figure 5.6(a) shows a general census transform graph with two examples of edges that
cannot be placed symmetrically about the center of the transform. We can create a symmetric
version of the graph simply by adding copies of the edges that are mirrored about the center pixel,
as shown in Figure 5.6(b), but these are redundant comparisons whose only real value is to balance
the similarity computation comparisons that go outside the correlation window.

(a)

(b)

Figure 5.6: Asymmetric generalized census example. (a) Graph containing
two asymmetric comparison edges. (b) Symmetric version created by adding
redundant comparisons.

Another noteworthy characteristic of the generalized census transform is that the effective
spread of the comparisons used in the similarity computations is smaller than with the standard
census transform, when the generalized census graph has been designed to minimize the size of
the window buffer required for its implementation.
For example, we can take the census transform shown graphically in Figure 5.1 and create
the generalized version shown in Figure 5.7. Notice that the comparisons within the correlation
window (the blue box) of both Figures 5.1(b) and 5.7(b) are nearly identical. The two graphs
also have the same number of comparisons (i.e., the same number of edges). The main difference
between the two graphs is the placement of the comparisons that go outside the correlation window.
Because the comparisons are brought closer to the center of the transform window in the
generalized version of the census transform, the comparisons made at the edge of the correlation
window are also brought closer to the center. This results in a transform with the same number

91

of comparisons, but a reduced spread near the edges of the correlation window. As a result, the
correlation accuracy of the two versions of the census transform is not identical, but is very similar.

(a)

(b)

Figure 5.7: Example of the reduced spread of the generalized census similarity computation. (a) Generalized census transform implementation of Figure 5.1(a). (b) Comparisons
when combined with 3 × 3 SHD correlation.

5.1.4

Generalized Census Correlation Accuracy
In order to compare the correlation accuracy of the generalized census transform to the

sparse transform described in Section 4.1, I designed six generalized census transforms, diagrammed in Figure 5.8. These transforms were designed based on the results of the first four
image datasets (the training set). All of these transforms are symmetric and all but the 16 and
12-edge transforms are completely non-redundant. Both of these graphs have two redundant edges
added to make the transform symmetric. Note that an n-edge generalized transform vector requires
the same number of bits as an n-point sparse census transform. Thus, the generalized transforms
in Figure 5.8 have the same hardware implementation costs as the sparse census transforms of
Figure 4.7, with the exception of the window buffer memory requirements, which are smaller for
the generalized census.

92

(a) 16-Edge

(b) 12-Edge

(c) 8-Edge

(d) 4-Edge

(e) 2-Edge

(f) 1-Edge

Figure 5.8: Generalized census transform graphs.

The correlation accuracy for each of the transforms of Figure 5.8 was evaluated for all eight
image datasets using 13 × 13 SHD. The results are shown in Table 5.1. Note that the “Full 7 × 7”
column in Table 5.1 is the accuracy of the standard census transform, which is again our baseline
for comparison.
From this table we see that the correlation accuracy for the generalized transforms are quite
similar to the accuracy of the sparse census transforms, which was shown in Table 4.3.
In the presence of noise, however, we would expect the generalized transforms, which
have much less redundancy, to perform better than the sparse transforms. Table 5.2 shows the
correlation accuracy for the generalized transforms when noise with a standard deviation equal
to 1% of the pixel range is added to the images. Compared to the previous results for the sparse
census transforms, shown in Table 4.4, we see that average correlation accuracy of the generalized
versions is always better than the sparse versions for the same number of census transform bits.

93

Table 5.1: Generalized Census Transform (Figure 5.8), Correlation
Accuracy Using 13 × 13 SHD
Dataset
Tsukuba
Venus
Teddy
Cones
Average
Rocks
Baby
Wood
Cloth
Average

Full 7 × 7

16-Edge

12-Edge

8-Edge

4-Edge

2-Edge

1-Edge

90.92
96.93
87.03
87.69
90.64
90.74
92.31
89.31
87.16
89.88

92.28
96.87
86.91
87.01
90.77
90.58
91.86
89.45
86.71
89.65

92.34
96.92
86.97
87.20
90.86
90.70
91.95
89.54
86.84
89.75

92.38
96.86
86.89
87.14
90.82
90.61
91.92
89.37
86.74
89.66

91.53
96.84
86.98
87.74
90.77
90.79
92.06
88.93
87.01
89.69

90.87
96.62
86.93
88.09
90.63
90.87
91.82
88.37
86.90
89.49

89.67
95.63
85.32
86.64
89.31
90.46
89.63
84.58
85.39
87.51

Table 5.2: Generalized Census Transform (Figure 5.8), Correlation Accuracy
on Images with 1% Noise Using 13 × 13 SHD
Dataset
Tsukuba
Venus
Teddy
Cones
Average
Rocks
Baby
Wood
Cloth
Average

Full 7 × 7

16-Edge

12-Edge

8-Edge

4-Edge

2-Edge

1-Edge

87.46
89.50
84.66
87.04
87.16
90.23
89.17
72.31
86.71
84.61

88.11
89.46
84.39
86.52
87.12
90.11
88.77
72.20
86.28
84.34

87.84
89.14
84.32
86.66
86.99
90.19
88.38
71.67
86.34
84.14

88.19
89.27
84.38
86.63
87.12
90.02
88.65
71.28
86.22
84.04

86.38
86.69
83.64
87.08
85.95
89.88
85.88
67.62
85.94
82.33

85.01
85.19
81.74
87.16
84.78
89.63
82.30
63.62
85.18
80.18

83.11
82.73
78.71
85.59
82.54
88.01
75.84
53.73
82.35
74.98

This suggests that these generalized transforms are more robust in the presence of noise than the
sparse transforms.
As with the sparse census transform, we can modify the graphs of Figure 5.8 to optimize
them for noisier images. Revised versions of the census transform graphs are shown in Figure 5.9.
The correlation accuracy for these revised graphs is shown in Table 5.3.
By comparing Table 5.3 with Table 4.5 we can see that, in the presence of a moderate
amount of Gaussian image noise, the generalized census transforms provide nearly the same level

94

(a) 16-Edge

(b) 12-Edge

(c) 8-Edge

(d) 4-Edge

(e) 2-Edge

(f) 1-Edge

Figure 5.9: Revised general census transform graphs for noisy images.

Table 5.3: Revised Generalized Census Transform (Figure 5.9), Correlation
Accuracy on Images with 1% Noise Using 13 × 13 SHD
Dataset
Tsukuba
Venus
Teddy
Cones
Average
Rocks
Baby
Wood
Cloth
Average

Full 9 × 9

16-Edge

12-Edge

8-Edge

4-Edge

2-Edge

1-Edge

88.85
90.53
84.80
86.64
87.71
90.16
89.87
73.35
86.58
84.99

89.44
91.28
84.23
85.75
87.67
89.39
89.06
74.00
85.78
84.56

89.29
90.98
84.28
85.82
87.59
89.52
89.04
73.63
85.89
84.52

88.89
91.00
83.82
85.58
87.32
89.04
88.76
72.89
85.59
84.07

87.95
89.66
83.86
85.85
86.83
89.21
88.27
70.06
85.79
83.33

86.37
87.84
83.15
86.17
85.88
89.31
86.77
68.37
85.60
82.51

84.58
85.17
80.59
84.21
83.64
88.39
81.98
62.29
83.52
79.04

95

of correlation accuracy as the optimized sparse census transforms of Figure 4.8. However, these
generalized transforms require a smaller window buffer for implementation.

5.2
5.2.1

The Generalized Rank Transform
Definition
The rank transform can be generalized in much the same way as the census transform by

removing the dependency on the center pixel of the transform neighborhood. We can mathematically define the generalized rank transform as follows. Let (c1 , c2 , . . . , cn ) and (c01 , c02 , . . . , c0n ) be
finite sequences of coordinates for some n ≥ 1. The generalized rank transform of an image point
p is then given by
RG (p) =

∑

ξ (p + ci , p + c0i ).

(5.2)

1≤i≤n

The name “rank transform” comes from the fact that the rank transform of a pixel represents the rank of that pixel’s intensity among its neighbors. The generalized version removes the
requirement that all comparisons be made with the same pixel. Thus, strictly speaking, it is no
longer a rank at all when the comparisons are made with arbitrary pixels. More accurately, it can
be viewed as an aggregation of the generalized census transform.

5.2.2

Characteristics of the Generalized Rank Transform
With the generalized census transform, we were able to visualize the effect of combining a

given census transform with SHD correlation. This allowed us to see exactly which comparisons
were included in the similarity measure and which, if any, were redundant. We could reduce
the required transform window buffer size and eliminate redundancy while maintaining symmetry
without penalizing correlation accuracy, leading to more elegant transforms. In contrast, since the
rank transform causes the aggregation of the census vector before the correlation step, this type
of analysis no longer applies. We cannot simply rearrange the transform graph edges to define a
nearly equivalent transform graph requiring a smaller transform window buffer.

96

To illustrate, consider the 2-edge rank transform graph of Figure 5.10(a), which represents
the standard two-point sparse rank transform neighborhood. We might be tempted to translate
each edge so that they are centered about the center pixel, reducing the transform window buffer
requirements. The resulting graph is shown in Figure 5.10(b). Unfortunately, because the rank
transform combines the two edges into a single scalar, the two edges cancel each other out. One of
the edges will always represent a greater than relationship while the other will always represent a
less than or equal to relationship. Therefore, the census transform will always be “10” or “01” and
the rank transform, which is the sum of the bits, will always be 1. Thus, the rank transform of any
pixel using the rank transform graph of Figure 5.10(b) will always be 1, making correlation with
this transform meaningless. Note that this is different from the census transform method, where
one of the edges is simply redundant and does not hurt correlation.

(a)

(b)

Figure 5.10: Generalized rank transform cancellation. (a) Standard 2-edge
graph. (b) Canceling 2-edge graph.

Now consider the graph of Figure 5.11(a), where the direction of one edge is switched.
In this case the edges do not cancel but represent the exact same relationship. As a result, the
correlation accuracy using this graph is no better than if a single edge were used. We can modify
the graph so that the two edges are slightly offset, as shown in Figure 5.11(b). Since most images
change fairly smoothly from pixel to pixel, these two edges are likely to cover very similar portions
of the image. As a result, this graph does not provide significant improvement over a single edge

97

graph. A better choice is the original graph of Figure 5.10(a), which represents the rank of a pixel
among two of its neighbors along a horizontal line in the image.

(a)

(b)

Figure 5.11: Generalized rank transform redundancy. (a) Redundant 2-edge
graph. (b) Nearly redundant 2-edge graph.

With the generalized rank transform, the various edges of the graph tend to cancel or combine, making useful edge selection more difficult. Nevertheless, we can carefully construct transform graphs that allow us to reduce the required size of the window buffer and increase the robustness of the transform in the presence of noise. A few such generalized rank transforms will be
discussed in the following section.

5.2.3

Examples of the Generalized Rank Transform
Figure 5.12 shows three rank transform graphs. Figure 5.12(a) represents a standard 3-point

sparse rank transform. Figures 5.12(b) and 5.12(c) represent two generalized rank transforms that
have been optimized in two ways. First, the area of the graph has been reduced so as to reduce
to the required window buffer size needed for a hardware implementation. Second, they avoid
making all comparisons with the same point, increasing robustness to image noise. Figure 5.12(c)
is particularly interesting because it includes the same edge twice, effectively weighting that edge
to be twice as important as the longer edge. Because it is purely horizontal on a single image row,
it also requires a minimum amount of image buffering—just five pixels—to implement. As we
will see, this graph performs very well on our benchmark image datasets.
98

(a)

(b)

(c)

Figure 5.12: Generalized rank transform examples. (a) Standard 3-edge graph.
(b) Overlapping 3-edge graph. (c) Linear 3-edge graph.

A large number of other generalized rank transforms are possible but creating transforms
that perform as well as the standard sparse rank transforms of Section 4.2 is difficult due to the
way the edges of the graph interact, as described in Section 5.2.2. Graphs with more than three or
four edges tend to perform best when the edges do not overlap, creating graphs more similar to the
standard sparse transforms. As a result, this section will focus on graphs having just three edges,
which leads to very efficient rank transforms where each pixel can be represented using a single
two bit number.
Table 5.4 shows the correlation accuracy for the three graphs of Figure 5.12. The traditional
7 × 7 rank transform is also shown as a baseline for comparison. All three graphs tend to perform
nearly as well as the standard 7 × 7 rank transform, and generally as well as the 4-point sparse
transform (Table 4.7).

99

Table 5.4: Generalized Rank Transform (Figure 5.12), Correlation
Accuracy Using 17 × 17 SAD
Dataset
Tsukuba
Venus
Teddy
Cones
Average
Rocks
Baby
Wood
Cloth
Average

Full 7 × 7

Standard 3-Edge

Overlapping 3-Edge

Linear 3-Edge

90.80
96.79
86.36
86.90
90.21
89.63
91.37
88.18
86.10
88.82

91.59
96.64
86.57
87.09
90.47
89.68
91.71
88.10
85.80
88.82

92.72
96.45
86.25
85.95
90.34
89.84
91.44
88.29
85.28
88.71

92.32
96.80
85.91
86.45
90.37
89.89
91.03
88.55
85.30
88.69

Table 5.5: Generalized Rank Transform (Figure 5.12), Correlation Accuracy
on Images with 1% Noise Using 17 × 17 SAD
Dataset
Tsukuba
Venus
Teddy
Cones
Average
Rocks
Baby
Wood
Cloth
Average

Full 9 × 9

Standard 3-Edge

Overlapping 3-Edge

Linear 3-Edge

87.26
89.50
83.29
86.12
86.54
89.01
86.66
70.47
85.05
82.80

85.84
87.47
81.46
86.13
85.22
88.82
80.86
66.87
83.45
80.00

88.92
88.28
83.41
85.15
86.44
88.75
87.58
68.44
84.47
82.31

88.22
89.81
83.40
85.98
86.85
89.49
88.05
65.31
84.73
81.90

Table 5.5 shows the correlation accuracy when we add noise with a standard deviation equal
to 1% of the pixel range to the images. In the presence of noise, the graphs of Figures 5.12(b) and
5.12(c) show increased robustness because they do not rely on the value of a single pixel to compute
the rank, as Figure 5.12(a) does. They also perform well compared to the full 9 × 9 rank transform,
despite the significant reduction in the number of bits needed to represent the generalized rank
transforms—from seven bits to two.

100

5.3

Qualitative Performance Analysis
As discussed in Section 4.3.1, the high-quality Middlebury stereo test images may be con-

sidered to be overly idealistic in terms of the quality of the images. As a result, it is useful to
qualitatively evaluate the performance of the generalized census and generalized rank algorithms
relative to other algorithms in order to validate our results. This section will provide the resulting
disparity maps that result from running the generalized census and generalized rank algorithms on
the the same set of real-world test images that was used in Section 4.3.
Refer to Section 4.3.1 for an overview of the methodology used when qualitatively evaluating stereo algorithms by a comparison of the disparity images. For disparity images to be
compared, refer to the results of the traditional stereo methods in Figures 4.11–4.14 as well as the
results of the sparse census and sparse rank methods in Figures 4.15–4.18 and Figures 4.19–4.22,
respectively.
Here, Figures 5.13–5.16 show the results of the generalized census using the graphs of
Figure 5.8. Compared to the disparity images for the sparse census, shown in Section 4.3.4, the
results for the generalized census are very similar. However, the results for the 2-edge generalized
transform are clearly better than the 2-point sparse census. The 2-edge generalized census even
appears to outperform all of the four traditional methods shown in Section 4.3.3. The generalized
2-edge transform used here is non-redundant and has edges in both the vertical and horizontal
direction. These features make the 2-edge generalized census more robust than the 2-point sparse
census used in Section 4.3.4. This result also demonstrates how sensitive the performance is to
edge selection when a small number of edges is used.
Figures 5.17–5.20 show the results for the 3-edge generalized rank using the graphs of
Figure 5.12. The overlapping generalized graph of Figure 5.12(b) clearly outperforms the other
two graphs. This is not surprising since the standard graph of Figure 5.12(a) is in fact a sparse
neighborhood that does not require the generalized definition and the linear graph of Figure 5.12(c)
includes some redundancy. These results are also consistent with the quantitative results reported
in Section 5.2.3. The 3-edge overlapping generalized rank results here also appear to outperform
the results of the 3-point sparse rank shown in Section 4.3.5.

101

(a) 16-Edge

(b) 12-Edge

(c) 8-Edge

(d) 4-Edge

(e) 2-Edge

(f) 1-Edge

Figure 5.13: Disparity images for the generalized census stereo method on the Pole images.

102

(a) 16-Edge

(b) 12-Edge

(c) 8-Edge

(d) 4-Edge

(e) 2-Edge

(f) 1-Edge

Figure 5.14: Disparity images for the generalized census stereo method on the Hydrant images.

103

(a) 16-Edge

(b) 12-Edge

(c) 8-Edge

(d) 4-Edge

(e) 2-Edge

(f) 1-Edge

Figure 5.15: Disparity images for the generalized census stereo method on the Spillway images.

104

(a) 16-Edge

(b) 12-Edge

(c) 8-Edge

(d) 4-Edge

(e) 2-Edge

(f) 1-Edge

Figure 5.16: Disparity images for the generalized census stereo method on the Driveway images.

105

(a) 3-Edge Standard

(b) 3-Edge Overlapping

(c) 3-Edge Linear

Figure 5.17: Disparity images for the generalized rank stereo method on the Pole images.

5.4

Summary
This chapter has defined the generalized census and generalized rank transforms. These

new transforms provide correlation accuracy similar to the original census and rank stereo methods, while providing additional flexibility in choosing the edges of transform. This allows greater
symmetry to be achieved in the correlation window during the correlation step. The transforms
also tend to have more inherent robustness in the presence of image noise than the original sparse
transforms proposed in Chapter 4 due to increased non-redundancy. Additionally, in some cases,
the generalized transforms allow a reduced transform window size to be used with minimal impact
on correlation accuracy, leading to reduced hardware resource requirements.

106

(a) 3-Edge Standard

(b) 3-Edge Overlapping

(c) 3-Edge Linear

Figure 5.18: Disparity images for the generalized rank stereo method on the Hydrant images.

Quantitative results showing the correlation accuracy on standard stereo image benchmarks
have been presented using the original images as well as noisy versions of the images. Optimizations have been described that increase robustness in the presence of image noise. Finally, disparity
images have been presented to confirm the quality of these algorithms under real-world conditions
through qualitative comparison.

107

(a) 3-Edge Standard

(b) 3-Edge Overlapping

(c) 3-Edge Linear

Figure 5.19: Disparity images for the generalized rank stereo method on the Spillway images.

108

(a) 3-Edge Standard

(b) 3-Edge Overlapping

(c) 3-Edge Linear

Figure 5.20: Disparity images for the generalized rank stereo method on the Driveway images.

109

110

CHAPTER 6.

HARDWARE IMPLEMENTATION

In Chapters 4–5, the sparse and generalized forms of the rank and census transforms were
introduced. Sections 4.1.1 and 4.2.1 described qualitatively how reducing the number of points
in a transform neighborhood leads directly to a reduction in the amount of hardware resources required to implement the digital circuit that performs stereo correlation using these algorithms. This
chapter will discuss actual hardware implementations of stereo vision algorithms using these new
methods as well as the original rank and census stereo vision methods. This will allow the trade-off
between correlation accuracy and hardware resource requirements to be quantitatively described.
To put the hardware resource savings afforded by these algorithms into proper perspective, this
chapter will also show the resource requirements for the traditional SAD stereo method, which is
generally considered to be one of the simplest and lowest-cost stereo methods.
Section 6.1 provides an overview of existing hardware implementations of stereo vision
systems that have been described in the literature. Section 6.2 will describe the architecture of
the stereo correlation system that will be used to evaluate the hardware resource requirements of
the different algorithms discussed in this dissertation. This architecture, although very similar
to previous systems, introduces specific enhancements that lead to significant reductions in the
amount of memory required to implement the system. Section 6.3 will show the actual FPGA
resource requirements of each stereo method. Finally, Section 6.4 will summarize the results.

6.1

Previous Work
Perhaps the earliest and one of the most important research projects in real-time, stereo

vision was completed by Faugeras et al. at INRIA in 1993 [57]. Their work describes in detail
how to build a complete stereo vision system, including camera calibration, rectification, similarity measures, disparity search, and 3D reconstruction. In their work, they propose a variety of
similarity measures, which are essentially variations of NCC, ZNCC, and normalized SSD. They

111

then describe two working implementations of their stereo methods. The first implementation uses
four Motorola 96002 DSPs. The second uses the DEC PeRLe-1 [61], making this implementation
the first stereo vision system using configurable logic. They further describe how stereo vision can
be employed for robot navigation, focusing largely on the application of robotic planetary exploration. Through their work, they exposed some of the most common stereo vision optimizations
used today.
Two years later, Kanade et al. described the CMU stereo vision machine [62], [63], now
one of the most well known stereo vision implementations. In this work, they propose the SSSD
(Sum of SSD) stereo correlation method, although they implemented a more computationally efficient variation called SSAD (Sum of SAD). This extension to standard, two-camera similarity
measures allowed them to create a stereo vision machine, first using a six, then later a five-camera
implementation. Their real-time implementation was built using a combination of frame grabbers,
PLDs (the predecessors of the modern FPGA), ROMs, RAMs, and a C40 DSP system. Due to
the limited technology available at the time the CMU machine was built, their system was quite
large and complicated, involving a large number of custom circuit boards integrated into a Sun
workstation via the VMEbus. Fortunately, technological advances would allow the complexity of
stereo vision systems to be dramatically reduced.
In 1997, several FPGA-based stereo vision implementations were introduced. Woodfill,
co-author of the original paper introducing the rank and census transforms [5], and Von Herzen developed a hardware implementation of the census stereo method using the PARTS reconfigurable
computer [64]. This computer consisted of 16 Xilinx XC4025 FPGAs, connected in a partial torus
with sixteen 1-MB SRAM chips. That same year, Dunn and Corke described an alternative implementation, also based on the census method, implemented using two CLP boards, a VMEbus circuit board containing one Xilinx XC4003H, two XC4008E and four XC4013E FPGAs [65]–[67].
Porter and Bergmann later described a generic hardware architecture for area-based correlation on
FPGAs using a variety of similarity measures [68]. Their architecture is essentially the same as
that described by Dunn and Corke, and bears similarities to nearly every hardware implementation
described since.
Another interesting and frequently cited implementation was described the same year by
Konolige [69]. This was a small, embedded, stereo vision module that employed a programmable
112

DSP for image processing. The system employed a LoG→SAD→LRCC stereo method. Unfortunately, because it used a programmable DSP, the performance it achieved, even on low-resolution
images, could hardly be called real-time.
In 2001, Arias-Estrada and Xicotencatl described one of the first real-time stereo vision
systems implemented on a single FPGA [70]. Their system used a single Xilinx XCV800 FPGA
to implement a simple SAD correlation. Their implementation consumed only 46% of the device,
although it had a very limited disparity search range of 16 pixels, used a relatively small correlation
window of 7 × 7, and did not employ any preprocessing.
Following this work, one of the most complete descriptions of an FPGA-based stereo vision system would be given by Miyajima and Maruyama [71]. Their implementation used an
XC2V6000 FPGA on a PCI board (the ADM-XRC-11 by Alpha Data) installed in a generalpurpose computer. The stereo algorithm employed was the LoG→SAD→LRCC combination,
making it the first detailed description of an LRCC implementation on an FPGA.
All of these stereo vision systems use a very similar architecture for stereo correlation. The
most significant difference between the architectures of these systems, other than the similarity
measure and preprocessing used, is the way in which the summing over the correlation windows
is computed. All of the similarity metrics discussed in this work involve summing over an W ×W
window of values to obtain the similarity measure for a pixel. Since each new window overlaps
the previous, the previous sums can be reused to compute the sum of the new window. Reusing
the data from previous similarity measure computations can effectively reduce the complexity of
stereo correlation from O(M 2W 2 d) to O(M 2 d), where M is the image width and height. Such
window summing optimizations are discussed in greater detail in Section A.7.
The first implementation to take advantage of such a window summing optimization was
described by Faugeras et al. [57], which used the column window summing optimization of Figure A.11. Using this method, it is possible to compute the sum for each window, regardless of
window size, with just four arithmetic operations. The disadvantage is that we must maintain an
M-entry buffer and an MW -entry buffer for each disparity to be considered. Since stereo correlation is generally parallelized by creating separate similarity computation units for each disparity
level, these buffers add significantly to the hardware costs. Additionally, the method described
requires several accesses to these buffers for each window sum computation, including the fetch of
113

the previous column sum, the fetch of the value to subtract from the column sum, the store of the
new column sum, and the fetch of the column sum for the trailing column of the window, although
this latter sum can be stored locally in a small FIFO delay buffer. The implementations that use this
summing optimization, as described, typically have a relatively large number of distinct external
memories, allowing for the independent access to the required buffers in memory by the several
similarity modules (e.g., [64]).
Dunn and Corke [65], and later Porter and Bergmann [68], described a variation of the
window summing optimization, based on the row summing method of Figure A.12. In their version, they compute row sums for the last W pixel differences for the bottom row of the window
and the row just above the top of the window. The pixel differences needed for these sums are
stored locally in two relatively small, W -entry FIFO delay buffers. They also maintain an M-entry
buffer of previous window sums. Thus, each window sum can be computed by taking the sum
of the window for the adjacent pixel in the previous row, subtracting the row sum for the top of
the window and adding the row sum for the bottom of the new window. Thus, their method only
requires a single read from and a single write to the M-entry window sum buffer to compute each
new window sum. The disadvantage of this method is that we must input two rows of the image
simultaneously, which requires a MW -entry buffer or additional reads, and we must compute the
difference measure for two pixels, at the top and bottom of the window, instead of just one.
Miyajima and Maruyama [71] proposed another variation, based on the column summing
method of Figure A.11, that eliminates the need for a large buffer for each disparity level. Instead
of delivering the image pixels row by row, they propose delivering the window pixels column by
column, one pixel at a time. A column module then uses an iterative serial adder to form the column
sum, which is then fed into the window module that maintains the sum of the previous W columns.
Since the similarity for each window column must be computed directly, the implementation has
a larger computational complexity of O(M 2W d). However, their implementation significantly
reduced the memory requirements from that of Dunn and Corke, since the M-entry buffer is not
required and only one W -entry buffer is needed to buffer the column sums. The disadvantage of
their implementation is that its throughput is reduced by a factor of W . This is a consequence of
the serialization of the window column pixels needed to compute the column sums. As a result,
they achieved less than real-time performance.
114

Since all of these implementations are architecturally quite similar and could be used with
nearly any of the similarity measures, an important question is then which preprocessing steps
and similarity measures require the smallest amount of hardware resources. Porter and Bergman,
mentioned previously, compared the resource requirements of stereo implementations using various similarity measures, providing some initial answers to this question. They found that the
Rank→SAD method required the least hardware, followed by simple SAD, Census→SHD, SSD,
then NCC [68]. However, the ordering here depends somewhat on the specific parameters applied
to each method.
The question was further investigated by Perez and Cabestaing, who included additional
optimizations such as the SMW and MSW in their analysis [72]. Unfortunately, they make the
oversimplifying assumption of using 1D correlation to approximate the 2D correlation windows
of standard, area-based correlation. This prevents them from taking into account the window summing optimizations employed by most stereo vision implementations. It also does not accurately
capture the way the census vector scales with census transform window size. Additionally, their
description suggests a misinterpretation of the census stereo method, using just the Hamming distance for the similarity measure rather than the sum of Hamming distances. As a result, their
findings indicate, incorrectly, that a census stereo implementation requires dramatically fewer resources than all other methods. If the Hamming distances are not summed over a window, the
stereo accuracy of the method is quite poor. Sadly, other researchers seem to have also made this
mistake, since the requirement for summing the Hamming distances is not very clear in the original
description of the census method [5].
Beginning in 2003, we see the emergence of FPGA-based, real-time stereo vision implementations that do not use standard area-based correlation techniques. For example, several papers
came out of the University of Toronto [73]–[76]. These papers describe various implementations
of the local-weighted phase correlation stereo method using FPGAs. They do not provide a comparison of their algorithm to other methods, making it difficult to determine the relative quality of
the stereo correlation. However, their implementation does require significant hardware resources,
making the method inappropriate for most resource-limited systems. Dı́az et al. later proposed an
FPGA implementation of a stereo vision algorithm based on phase measurement [77]–[79]. They

115

achieve high performance using relatively few resources, but the quality of their disparity maps is
worse than conventional area-based correlation methods.
This summarizes the most important contributions to custom hardware implementations of
stereo vision systems. In addition to these, there are a number of published works that describe
specific implementations of a given stereo vision system. Unfortunately, most descriptions are
not sufficiently detailed to understand the key aspects of the implementation, and many of these
papers do not add to the contributions already mentioned. However, they do serve as examples
of implementation. For example, Jia et al. described a trinocular stereo implementation that uses
a LoG→SSAD stereo method on a single Xilinx XC2V2000 FPGA [80]. Gil et al. described
a feature-based stereo implementation that uses the census transform and SHD correlation [81].
Yariyama et al. described a stereo system based on SAD that uses iterative refinement to find the
best match [82]. Naoulou et al. described the implementation of their stereo system based on
the census method [83]. Cuadrado proposed an alternative architecture for stereo correlation that
uses the SAD similarity measure but has significantly higher on-chip memory requirements than
most other implementations [84]. Ambrosch et al. also describe a stereo implementation based on
SAD [85].
The main difference between most of these later implementations is the level of technological advancement that was available when the system was built. Those that were implemented more
recently were able to take advantage of improved FPGA technology. Many have sought to compare
the performance of their stereo systems to previously published results by measuring throughput
using various figures of merit, such as disparities per second (i.e., the total number of disparities
evaluated or searched per second). However, since most of these systems are architecturally quite
similar, the main factor allowing them to achieve this level of performance is likely the process
technology used to fabricate the FPGA (e.g., 45 nm vs. 90 nm). This affects the maximum on-chip
clock rates as well as the logic capacity of the FPGA, which in turn dictates to what extent one can
parallelize the disparity search. Beyond this, the most influential factors are the low-level design
details, such as the extent of pipelining and data widths chosen, which are almost never described.
Most high-performance, real-time stereo vision systems that use local stereo methods, such
as those described in this section, employ a stereo correlation architecture similar to that of Figure 6.1. In this architecture, correlation begins by applying any needed preprocessing on the input
116

images. This would include, for example, the rank and census transforms, LoG, or other filters.
Next, one image pixel stream is delayed relative to the other so that, on each clock cycle, a pixel
from the original copy of one image and a corresponding pixel from a shifted version of the other
image are input into each similarity module. Thus, each similarity module computes the similarity measure for a pair of pixels at a specific disparity level. In other words, Similarity Module 0
computes the similarity for all pixel pairs having a disparity of 0, Similarity Module 1 computes
the similarity for all pixel pairs having a disparity of 1, and so on, for a total of d disparity levels.
Since most similarity measures require summation over a window of pixels, this is handled in each
similarity module. Finally, each select-best module outputs the disparity having the best similarity
score, choosing between the score for its disparity level and the best from all previous disparity
levels. The output of the select-best module d-1 is the resulting disparity map.
Usually, this architecture is configured so that one pixel from each image can be input into
the system on each clock cycle. Therefore, d disparities are computed on each clock cycle. In this
case, the disparities per second is obtained by multiplying d by the clock frequency.

Pixel Distribution Network
Left Pixels

Preprocessing

Right Pixels

Preprocessing

Score Pixel
Similarities

Z -1

Similarity
Module 0

Z -1

Z -1

Similarity
Module (d-2)

Similarity
Module 1

Select
Best 0

Similarity
Module (d-1)

Select
Best (d/2-1)

Select Best
Match

Select
Best (d-2)

Disparity and
Similarity for
Best Match

Figure 6.1: General stereo correlation architecture.

117

6.2

Hardware Architecture
This section will describe the architecture that will be used to evaluate the hardware re-

source requirements of the methods described in Chapters 4–5. The architecture described here is
similar to that described in [65] and uses the same general architecture as Figure 6.1. However,
several architectural changes are proposed here that increase potential throughput and significantly
reduce the amount of memory required.

6.2.1

Pipelining
The architecture shown in Figure 6.1 has a large critical path through the select-best mod-

ules that limits the maximum clock rate. As described, this path includes dlog2 (d)e select-best
modules, each of which includes a comparison and a multiplexer. With a large disparity search
range (d), this delay can be excessive. Since each LUT in most FPGA architectures is typically
paired with a flip-flop, pipeline registers can be added here effectively without using any additional
logic elements.
The pixel distribution network can also be a hindrance to achieving high clock rates. Because both the disparity search range and the amount of logic associated with each disparity level
can be large, the locations of the similarity modules can be physically distant on the chip after
final placement, leading to long delays. We can cut these long paths and separate the distribution
network delays from the input logic of the similarity modules by pipelining the pixel distribution
network as well.
The proposed fully-pipelined architecture is shown in Figure 6.2. This architecture allows
a new pixel from each image to be input on each clock cycle and allows for d disparities to be
computed on each clock cycle while ensuring that the clock rate is maximized. Of course, the
similarity modules also must be pipelined internally to maximize throughput.
Note that the tree structure of the select-best modules has been abandoned in this architecture. The linear structure proposed here requires the same number of select-best modules as
the tree structure yet requires significantly fewer registers to correctly synchronize the data for
comparison (d rather than 2d).

118

In cases where the pixel distribution network is not the critical path, the additional pipelining in the pixel distribution network can be removed (as in Figure 6.1) and the linear select-best
structure can be replaced with a tree structure having registers on the output of each select-best
module. This would reduce the number of flip flops required. However, for the build results shown
later in this chapter, the architecture of Figure 6.2 will be assumed.

Pixel Distribution Network
Left Pixels

Right Pixels

Preprocessing

Z

-1

Z -1

Z -1

Preprocessing

Z -2

Z -2

Z -2

Similarity
Module 0

Similarity
Module 1

Z -1

Select
Best 1

Similarity
Module (d-1)

Similarity
Module 2

Z -1

Select
Best 2

Z -1

Select
Best (d-1)

Disparity and
Similarity for
Best Match

Figure 6.2: Proposed, fully-pipelined correlation architecture.

6.2.2

Memory Architecture
Most stereo vision implementations, including the one described here, use memory to

buffer computations that can be reused later. This dramatically decreases the required number
of computations and the amount of associated logic for the stereo implementation at the expense
of additional memory. This section describes how memory will be used by the preprocessing
and similarity module blocks. Several optimizations are also described that reduce the memory
requirements compared to previous works.

Similarity Module
The architecture described in [65] uses a window summation optimization in which small
delay buffers are used to compute the sum of the top and bottom rows of the correlation window.
These row sums are then used to compute the window sums, as shown in Figure A.12. The architecture for a similarity module based on this window summation method is shown in Figure 6.3.
119

IL(x,y)
IR(x-d,y)

IL(x,y-W)
IR(x-d,y-W)

Pixel
Difference

Pixel
Difference

Sum W

Window
Row Sum

-

+

Window
Sum

Sum W
Image Width Delay

Figure 6.3: Similarity module based on [65].

This architecture requires a small memory for each Sum W module that is wide enough to
hold a pixel difference measure and has a depth equal to W , where W is the width of the correlation
window. The architecture also requires a relatively large memory (labeled Image Width Delay in
the figure) that is wide enough to hold a full window sum and has a depth equal to the width of the
image.
The maximum pixel difference output value with this architecture is n, where n is the number of edges in the census or rank graph. Therefore, the required width for each pixel difference
is dlog2 (n + 1)e bits. Also, the maximum value of the window sum is nW 2 , where W is the width
and height of the correlation window. Therefore, the required width for each window sum value

is dlog2 nW 2 + 1 e. As a result, using this architecture, the total number of memory bits required
for d similarity modules is
 
d 2W dlog2 (n + 1)e + Mdlog2 nW 2 + 1 e ,

(6.1)

where M is the width of the image.
An alternative to the window summation method of [65] is to compute a window column
sum and then use column sums to compute the window sum, as shown in Figure A.11. Although
computationally equivalent, instead of requiring an M-deep memory of previous window sums,
this method requires an M-deep memory of previous column sums, which are typically several
bits narrower. Additionally, only one W -deep buffer is required for each module instead of two,
although its width is a wider column sum instead of a pixel difference.
The architecture for the similarity module using this method is shown in Figure 6.4. In this
case, the maximum value for the column sum is equal to nW . Therefore, the required width for

120

each column sum is dlog2 (nW + 1)e. As a result, the total number of memory bits required for d
of the proposed similarity modules is
d (W + M) dlog2 (nW + 1)e.

IL(x,y)
IR(x-d,y)

IL(x,y-W)
IR(x-d,y-W)

Pixel
Difference

-

+

Pixel
Difference
Image Width Delay

(6.2)

Sum W

Window
Sum

Window
Column
Sum

Figure 6.4: Proposed similarity module architecture.

Figure 6.5 shows the total amount of memory required for d similarity modules using the
proposed method and that of [65] for the generalized census and rank with different numbers of
graph edges (n). This figure assumes a correlation window size (W ) of 13, a disparity search (d) of
64, and an image width (M) of 640. These values are typical of the parameters used to achieve the
highest correlation accuracy using the test image datasets and algorithms in the previous chapters.
As can be seen in the figure, the original method requires between 97% and 30% more memory
than the proposed method.
Note that the similarity module memory requirements for the rank methods are identical
to that of the census methods (Equation 6.2) since they both require the same width for the pixel
difference.

Preprocessing and Transform Placement
The similarity-module architecture of Figure 6.4 requires that we provide top and bottom
pixels for the correlation window (i.e., pixels (x, y) and (x, y − W ) from each image) in order to
efficiently compute the window sum. This requires a relatively large memory that is W M words
deep, where W is the correlation window size and M is the image width. Therefore, the width of
this memory can have a significant impact on the overall memory required by the stereo system.
121

600

500

Memory (kbit)

400

300

200
Proposed
Previous
100

0

0

5

10

15

20
25
30
Graph Edges

35

40

45

Figure 6.5: Memory requirements for the proposed similarity module
architecture and the previous architecture described in [65].

We can conserve memory for the generalized census by controlling at what point we perform the census transform. We can perform the transform at the very beginning, before the buffering needed to obtain the top and bottom pixels, or we can perform the transform after the buffering.
These two options are shown in Figure 6.6. Notice that if we do the top/bottom buffering first then
we must perform the census transform on four image streams (left-top, left-bottom, right-top, and
right-bottom) rather than just two (left image and right image).
Unlike some other preprocessing operations, the census transformed pixel can be much
larger than the original pixel. For example, the original 7 × 7 census transform creates a vector
that is 48 bits wide, six times wider than the original 8-bit input pixels. In this case, it makes sense
to buffer the untransformed 8-bit pixel first rather than buffer the 48-bit transformed pixel. Even
though buffering first requires that the census transform be computed twice as many times, the
amount of logic required to compute the census transform can be small compared to the amount of
memory required to buffer a large census vector.
On the other hand, the generalized census vector may be as small or smaller than the original pixel size. In this case it makes sense to perform the transform first to reduce the width of the
top/bottom buffer memory.
122

Pixels

Buffer
Top &
Bottom

Transform

T(x,y-W)
T(x,y)

(a) Transform first

Pixels

Buffer
Top &
Bottom

Transform

Transform

T(x,y-W)

T(x,y)

(b) Buffer first

Figure 6.6: Preprocessing architecture options. (a) Compute the transform first.
(b) Buffer to obtain the top and bottom window pixels first, requiring twice as many
transforms.
If we perform the census transform first, as in Figure 6.6(a), we require a memory that
is M(N − 1) bits deep and p bits wide, where N is the width and height of the census transform
window and p is the number of bits per pixel. Therefore, the census transform requires a memory
that is pM(N − 1) bits. The top/bottom buffer requires a memory that is MW words deep and n bits
wide, for a total of nMW memory bits. Since there are two preprocessing blocks, one for the left
image and one for the right image, the total number of memory bits required for the preprocessing
blocks is
2 [pM (N − 1) + nMW ] .

(6.3)

If we perform the buffering first, as in Figure 6.6(b), the top/bottom buffer requires pMW memory
bits and each transform again requires pM(N − 1) memory bits. Therefore, in this case the total
number of memory bits required for the preprocessing blocks is
2 [pMW + 2pM (N − 1)] .

123

(6.4)

The total amount of memory required for a generalized census stereo system is the sum
of Equation 6.2 and either Equation 6.3 or Equation 6.4. Figure 6.7 shows the total amount of
memory required for the proposed correlation architecture, including the census transform, top
and bottom pixel buffering, and the similarity modules. As before, this figure assumes W = 13,
d = 64, and M = 640. Additionally, this figure assumes a 7 × 7 census transform window (N = 7)
and 8-bit pixels (p = 8). In the figure, we see that, for this configuration, computing the transform
first is optimal for transform graphs having less than 12 edges and that performing the top/bottom
buffering first is optimal for graphs having 12 or more edges.

1400
Transform First
Buffer First

1200

Memory (kbit)

1000

800

600

400

200

0

0

5

10

15

20
25
30
Graph Edges

35

40

45

Figure 6.7: Memory requirements for the proposed architecture
using the generalized census.

For the proposed generalized rank transforms, the transformed pixel is always smaller than
the original 8-bit pixel, so it is optimal to perform the rank transform first in order to minimize
the amount of the memory required. In this case, the rank transform requires a memory that
is pM(N − 1) bits, as with the census, and the top/bottom buffering requires a memory that is
MW dlog2 (n+1)e. In general, the total number of memory bits required for the generalized rank
preprocessing blocks is
2 [pM (N − 1) + MW dlog2 (n + 1)e] .
124

(6.5)

The total amount of memory required for a generalized rank stereo system is given by
the sum of Equations 6.2 and 6.5. This is depicted in Figure 6.8. This figure assumes the same
parameters as Figure 6.7, except that the correlation window size (W ) is 17. This change is appropriate since the rank method achieves a higher correlation accuracy on the test image datasets
using this size. Notice that for most sparse configurations, the generalized census actually requires
less memory than the generalized rank.

1400

1200

Memory (kbit)

1000

800

600

400

200

0

0

5

10

15

20
25
30
Graph Edges

35

40

45

Figure 6.8: Memory requirements for the proposed architecture
using the generalized rank.

Memory Sharing
Although the total number of memory bits required for the stereo implementation is very
useful in comparing the requirements of different implementations, the sizes of the memories on
FPGAs are fixed, supporting only a small number of configurations and usually having depths
that are a power of two. This often leads to wasted memory since the FPGA cannot efficiently
implement the exact memory size needed.

125

The depth of most of the memories required for a stereo system is a function of the image
width, which is determined by the camera or application. For example, if the input image width is
512 pixels then most of the memory in the stereo system will need to be at least 512 entries deep.
Since this is a power of two, an FPGA may be able to implement this depth exactly, resulting in less
waste. However, if the image width is 640, a standard image resolution, then the memories will
likely need a depth of 1,024, leading to a utilization of 62.5%. By reducing the image width down
to the nearest power of two, we can take advantage of this otherwise wasted memory by increasing
each memory’s configured width, thus reducing the total number of memory blocks required for
the system.
Similarly, it is inefficient to use memory widths that are not supported by the memory
blocks on the underlying FPGA architecture. For example, suppose that the column-sum buffer
(i.e., the image width delay memory) in the similarity module of Figure 6.4 has a width of 10 bits
(this is the column-sum width for the original 7 × 7 census and rank when combined with 13 × 13
or 17 × 17 correlation). The nearest memory width that is large enough and is supported by Xilinx
FPGAs is 18 bits. This means that about 44% of the memory width is wasted. Since there are
generally many similarity modules, this waste is quite significant.
Fortunately, in this case there is a solution that dramatically increases memory utilization
and therefore conserves memory resources on the FPGA. Most implementations of the proposed
stereo architecture can be designed, with no additional hardware, so that reads and writes to the
image-width-delay memories are synchronized for all disparity levels. In this case, we can share
physical memories between similarity modules, since they will be guaranteed to have the same
read/write access patterns. Using the proposed fully-pipelined architecture of Figure 6.2, a sufficient condition to guarantee this synchronization is if the data introduction interval is restricted to
1 (i.e., exactly one pair of pixels is input on each clock cycle). For the architecture of Figure 6.1,
there is no such requirement.
Following our previous example, suppose our stereo implementation needs to support 64
disparity levels, an image width of 640 pixels, and we again need a 10-bit column-sum buffer.
Without sharing, we require 64 18-bit wide block memories, each of which is 1,024 entries deep.
However, if we share the memories across similarity modules then we require d64 · 10/18e = 36
block memories, a savings of nearly 44%.
126

This example actually represents the smallest amount of memory saved using this technique
among all the configurations described in Section 6.3 since 10 bits is the largest column-sum buffer
width needed and Xilinx memory blocks with a depth of 1,024 words can only have a width of 18
bits. At the other extreme, suppose we are using a 1-point or 1-edge census transform. In this case,
assuming a correlation window size of 13, we require a 4-bit column-sum buffer, which a Xilinx
FPGA can implement using a single memory block configured to be 4 bits wide. Therefore, in this
situation we would again need 64 memory blocks. However, if we use memory sharing then we
can configure the memories to be 18 bits wide, resulting in d64 · 4/18e = 15 memory blocks, or a
savings of about 77%.

Reduced Transform Window
Section 5.1.2 discussed how one of the advantages of the generalized census transform is
that it often allows us to reduce the size of the transform window while maintaining the same
number of edges by overlapping them in ways that were not permitted with the sparse or the
original census transform. With the stereo architecture defined, we can determine how much this
change affects the overall memory requirements of the stereo system. Referring to Equations 6.3
and 6.4, we can see that the amount of memory required for the transforms increases linearly as a
function of the transform window size, N, except at the point where performing the transform after
buffering becomes optimal.
Figure 6.9 shows the total amount of memory required for the correlation system if we fix
all parameters other than the transform window size. In this case, we use the same parameters
as before (M = 640, p = 8, W = 13, and d = 64) but fix the number of edges to n = 16. As a
point of reference, the 16-point sparse census (N = 7) requires 576.5 kbit, whereas the 16-edge
generalized graph (N = 5) requires 536.5 kbit, a savings of about 7%. In contrast, if we instead
tried to conserve memory by reducing the number of points in the transform, without changing the
size of the transform window, then we would have to reduce the number of points from n = 16 to
n = 11 before having any affect since the critical memory widths remain the same down to n = 12
(see Figure 6.7).

127

800
700

Memory (kbit)

600
500
400
300
200
100
0

3

5

7
9
Transform Size

11

13

Figure 6.9: Total stereo correlation memory as a function
of transform window size.
6.3

Resource Requirements
This section shows the actual synthesis results for the stereo correlation system proposed

and described in Section 6.2. To allow for easy comparison between the different methods, a
fully parameterizable stereo correlation system was designed in VHDL that supports any SAD,
generalized rank, or generalized census stereo configuration. Since the original census, original
rank, sparse census, and sparse rank are subsets of the generalized methods, any of the rank and
stereo methods can be implemented using this design. SAD is used as a point of comparison, since
it is very common and is generally considered to be the simplest and least resource intensive of
the local methods. All implementations use the same parameters that were used in Chapters 4–5,
except that the image width (M) is assumed to be 640 and the disparity range (d) is fixed at 64. All
results were generated using Xilinx ISE 12.4 targeting the Xilinx Virtex-6 LXT FPGA device. The
default optimization settings were used (i.e., optimize for speed using the normal effort level).
Table 6.1 shows the actual resource requirements for all the stereo methods discussed in
Chapters 4–5. The most important column in this table is the number of LUTs (lookup tables),

128

which corresponds to the amount of logic required to implement the method. Next to this, the table
also shows the minimum memory requirement for the configuration.
For reference, the table also shows the number of 1-bit registers (FFs), the number of
18-kbit block memories (RAMs), and the maximum operating frequency reported by synthesis,
although these numbers can vary significantly depending on the details of the implementation. For
example, the number of registers corresponds to the amount of pipelining used more than it does
to the minimum logic requirements of the architecture. Similarly, the actual maximum operating
frequency is highly dependent on the extent of pipelining as well as the final placement, which is
heavily influenced by outside logic. In the examples shown here, the transform is always performed
first, resulting in an inflated number of RAMs for the full, 16-point, and 16-edge census variants.
Finally, the average accuracy on the test image data sets (Rocks, Baby, Wood, and Cloth)
is also shown for easy comparison of correlation accuracy.
The results in Table 6.1 show the benefit of the sparse and generalized transforms. For
example, the 2-edge generalized census has been shown to perform nearly as well as the original
census and much better than SAD. Yet it requires 88% less logic and 61% less memory than the
full census. SAD is considred one of the simplest stereo methods, yet the 2-edge census requires
61% less logic and 59% less memory than SAD while delivering 4.4% higher correlation accuracy
on the test images. For increased robustness, larger census graphs can be used, such as the 8-point
census, which still requires 74% less logic than the full census and 19% less logic than SAD while
delivering the same correlation accuracy as the full census and 4.9% better correlation accuracy
than SAD on the test images.
Although it is generally assumed that the rank requires less logic than the census, we see
that for small numbers of edges, the census actually requires less logic and memory, while still
being slightly more accurate than the original rank on the test dataset. This is a result of the rank
method requiring a larger 17 × 17 correlation window to achieve maximum correlation accuracy.
If we reduce the correlation window of the rank methods to match the 13 × 13 size that was used
for the census then the memory requirements become similar to that of the census for the most
sparse transforms at the expense of a slight reduction in correlation accuracy for our test dataset.
The results also show the benefit of the bit-optimized rank (i.e., the 15-point, 7-point, and
3-point sparse rank). Each of these shows a considerable drop in the resource requirements relative
129

Table 6.1: Hardware Resource Requirements for Stereo Correlation
Methods. The census and SAD methods use a 13 × 13 correlation
window and the rank method uses a 17 × 17 correlation window.
Method
SAD
Census, 7x7
Rank, 7x7
Sparse Census, 16-point
Sparse Census, 12-point
Sparse Census, 8-point
Sparse Census, 4-point
Sparse Census, 2-point
Sparse Census, 1-point
Sparse Rank, 16-point
Sparse Rank, 15-point
Sparse Rank, 12-point
Sparse Rank, 8-point
Sparse Rank, 7-point
Sparse Rank, 4-point
Sparse Rank, 3-point
Sparse Rank, 2-point
Sparse Rank, 1-point
Gen. Census, 16-edge
Gen. Census, 12-edge
Gen. Census, 8-edge
Gen. Census, 4-edge
Gen. Census, 2-edge
Gen. Census, 1-edge
Gen. Rank, Std. 3-edge
Gen. Rank, Ovr. 3-edge
Gen. Rank, Lin. 3-edge

LUTs

Memory
(kbits)

FFs

RAMs

7,806
24,787
7,527
9,897
8,166
6,346
4,200
2,961
2,677
6,107
5,797
5,350
5,222
4,356
4,342
3,682
3,587
2,638
9,874
7,771
6,393
4,137
3,013
2,650
3,678
3,667
3,573

620
658
598
577
537
456
350
237
180
536
474
454
454
391
391
329
289
227
537
537
456
330
257
180
319
309
289

9,831
29,860
8,876
12,743
10,232
8,005
5,462
3,831
2,877
7,572
6,741
6,412
6,294
5,398
5,346
4,389
4,219
3,134
12,644
10,261
8,101
5,398
3,903
2,845
4,344
4,309
4,219

59
138
54
67
57
45
34
22
17
48
43
41
41
35
35
30
26
20
65
57
45
32
24
17
30
28
26

130

Freq.
(MHz)
487
298
310
464
486
506
510
518
540
418
491
498
502
506
506
510
510
518
464
486
506
510
518
540
510
510
510

Average
Accuracy

85.72
89.88
88.82
90.02
89.98
89.88
89.96
89.01
87.58
88.94
88.97
88.92
88.97
88.98
88.54
89.09
87.20
88.11
89.65
89.75
89.66
89.69
89.49
87.51
88.82
88.71
88.69

to the 16-point, 8-point, and 4-point versions, respectively. For example, the 7-point sparse rank
requires essentially the same amount of logic and memory as the 4-point transform. Based on this,
there is no clear reason to ever use a 4-point transform rather than a 7-point since a 4-point offers
no resource savings and will generally have reduced correlation accuracy.
The estimated clock rates for this architecture are quite high but do decrease as the transform size increases. For the largest transforms, the critical logic path is the population count
computation (part of the Hamming distance), which could be further pipelined to increase clock
rate. This would be most important for the original 7 × 7 census and rank, which suffer the most
from this critical path length. For sparser methods, the critical path becomes the select-best cores,
which consist of a comparison and multiplexer. This block could also be further pipelined for
potentially increased throughput. For reference, if we assume 500-MHz operation, the design results in a stereo performance of 32 billion disparities per second and a frame rate of about 1,627
frames per second at VGA resolution (640 × 480) or 381 frames per second at SXGA resolution
(1280 × 1024).
These results show that there are a number of useful trade-offs that can be made between
transform window size, transform graph, and correlation window size, allowing for very highperformance and high-accuracy stereo systems to be implemented, even in resource-constrained
systems. Furthermore, some of the sparse and generalized examples have significant resource
and/or accuracy advantages over more commonly used methods such as SAD. Since SAD is much
simpler than most other stereo methods, the resource savings will also be significant when compared to most other commonly used stereo algorithms, such as those described in Appendix B.

6.4

Summary
This chapter has introduced a high-performance hardware architecture for implementing

the sparse and generalized stereo methods, as well as traditional local methods. Several hardware optimizations have been introduced that improve upon the architectures presented in previous
works. For example, the proposed similarity module resulted in between 23% and 49% savings in
memory compared to the similarity-module architecture described in previous works. The concept
of transform placement for optimal memory usage with the census methods was also introduced,
resulting in up to 47% total memory savings compared to the naive configuration used in previ131

ous works. Finally, the concept of memory sharing between similarity modules was introduced.
For the configurations described in Table 6.1, this memory sharing leads to savings between 44%
and 77% in the number of 18-kbit memory blocks required to implement the array of similarity
modules in a Xilinx FPGA.
Most importantly, this chapter also reported the actual resource requirements for the stereo
methods described in the previous chapters. These results quantify the potential resource savings that can be achieved using the proposed algorithms, demonstrating savings as high as 89%
compared to the orignal census transform. In fact, the savings are so significant that most of the
proposed algorithms require even less hardware to implement than the SAD stereo method, which
is traditionally considered to be the simplest. Yet the accuracy is always better than SAD on the test
images, reaching as high as a 4.9% improvement over SAD for largest implementation that is still
smaller than SAD. This is a key finding since none of the local methods described in Section 3.3
offer that much improvement and all of them require more hardware resources than SAD.

132

CHAPTER 7.

THE MULTI-BIT CENSUS TRANSFORM

This chapter will introduce and analyze the multi-bit census transform. This transform is
intended to provide a greater level of pixel discrimination than that which is offered by the original
census transform. Although not known at the time this research was performed, a related work can
be found in [86].
Section 7.1 will provide some additional background on the motivation for a multi-bit census transform. Section 7.2 will formalize the definition of the multi-bit census. Section 7.3 will
qualitatively describe how a hardware implementation of the multi-bit census differs from the
original census. Section 7.4 will provide results for correlation accuracy on benchmark images.
Section 7.5 will describe the trade-off between correlation accuracy and implementation costs.
Finally, Section 7.6 will provide additional summary.

7.1

Motivation
In Section 2.4, the definition of the original census transform was given (Equation 2.3)

and in Section 5.1 a more generalized version of the census transform was defined (Equation 5.1).
Both of these can be defined in terms of the same function, ξ (p, p0 ) (Equation 2.2), to indicate the
relationship between the intensity of two pixels.
The original definition of this function is somewhat arbitrary. It is possible to redefine
the function to provide a higher degree of discrimination between different pixels. In fact, one
possible disadvantage of the census transform is its inability to discriminate between situations
where I(p) > I(p0 ), but they are very nearly equal, and situations where I(p)  I(p0 ). If we
replace ξ with function that provides a more descriptive representation of the difference between
two pixels, then we can achieve a greater level of discrimination in the census transform.

133

7.2

Definition
To begin, let us rewrite Equation 2.2 as
ξ (p, p0 ) = 1 if I(p) − I(p0 ) > t,

(7.1)

0 otherwise.
Written this way, we see that ξ (p, p0 ) has a value that is based on whether or not the intensity
difference between p and p0 is greater than some threshold t, where traditionally t = 0. Let us now
define an ordered set of thresholds, T , with cardinality |T | and elements Ti , where 0 ≤ i < |T |. We
can then define a non-binary, or multiple-bit, version of ξ , called ξM :
ξM (p, p0 ) = 0 if I(p) − I(p0 ) ≤ T0 ,
1 if I(p) − I(p0 ) > T0 and I(p) − I(p0 ) ≤ T1 ,
..
.

(7.2)

|T | − 1 if I(p) − I(p0 ) > T|T |−2 and I(p) − I(p0 ) ≤ T|T |−1 ,
|T | if I(p) − I(p0 ) > T|T |−1 .
There is no reason why ξM must be assigned integer values in the range [0, |T | + 1]. Instead, the
mapping between threshold regions and values could be arbitrary. However, to achieve an efficient
encoding in binary and a linear weighting, it makes sense to at least use sequential integer values.
Furthermore, it is logical that |T | be odd and that the elements of T be symmetric about 0, since
I(p) > I(p0 ) and I(p) < I(p0 ) are equally likely in general, but this again is not a requirement.
With these assumptions we can define the n-bit census transform to be a multi-bit census transform
where |T | = 2n − 1. Thus each pixel of the n-bit census-transformed image is n bits.
With the original, sparse, and generalized census transforms, dissimilarity between two
transformed pixels was measured using the Hamming distance, which is equal to the sum of the
non-zero bits (i.e., population count) of the XOR of two census-transformed pixels. Noting that the
XOR of two single-bit numbers A and B is equal to |A − B|, we can replace the single-bit value of
ξ XOR ξ 0 in the original census method with the multi-bit result |ξM − ξM0 |. As with the Hamming
distance, summing these results for the chosen census neighborhood then achieves a measure of

134

dissimilarity for a pair of multi-bit-census-transformed pixels. Aggregation can then be achieved,
as before, by summing the dissimilarity measures over a window.
Using the definition of ξM in Equation 7.2 and the revised definition for the dissimilarity
measure just discussed, we see that the original census transform is equivalent to the multi-bit
census transform for the case where T in Equation 7.2 is the set (0).

7.3

Hardware Implementation of a Multi-bit Census Transform
The multi-bit census transform inherently maintains more information from each pixel

comparison and this increased level of discrimination comes at the cost of additional hardware
resources. Immediately we can see that the n-bit census transform requires n times as many bits
to encode as the original 1-bit census transform when using the same census graph. Therefore we
would expect much of the correlation hardware employing the multi-bit census in a stereo vision
implementation to require roughly n times the amount of hardware resources.
The multi-bit census transform computation itself requires additional hardware. Each edge
in the the multi-bit census transform of a pixel requires |T | comparisons to be implemented followed by logic to convert the results of these comparisons into a binary encoding of ξM . In the
1-bit census (i.e., the traditional census transform) this logic is very simple, requiring only a single
comparison, the result of which is the 1-bit binary encoding. For a 2-bit census transform (|T |
= 3), this logic is a 3 input, 2-output binary function. In general, the n-bit census transform will
require 2n − 1 comparisons per transform graph edge followed by the logic to encode the result of
the thresholds.
We can optimize this process by choosing thresholds that are evenly spaced multiples of 2k .
In this case, we can simply compute I(p) − I(p0 ), drop the lower k bits, then saturate the result to an
n-bit integer. This effectively converts a O(2n ) calculation into O(1). Performing this computation
once can be dramatically simpler than performing 2n − 1 comparisons and encoding the outcome.
An example for the census transform of a 4-bit image where n = 2 and k = 1 (i.e., T = (−2, 0, 2))
is shown in Table 7.1.
Fortunately, because the additional comparisons required for the multi-bit census transform can be done in parallel, a multi-bit implementation in hardware does not necessarily add any

135

Table 7.1: 2-bit Census Transform Thresholding and Encoding for T = (−2, 0, 2)
I(p) − I(p0 )
..
.
1010
1100
1101
1110
1111
0000
0001
0010
0011
0100
..
.

With k Bits Dropped
..
.
101
110
110
111
111
000
000
001
001
001
..
.

Saturated (ξM )
-2
-2
-2
-2
-1
-1
0
0
1
1
1
1

computation time. Depending on the implementation, a small increase in latency is likely, but
throughput can be maintained at the same rate as the traditional census transform.

7.4

Multi-bit Census Transform Correlation Accuracy
There are a large number of variables that affect the correlation accuracy of the multi-

bit census transform. These variables include the version of the census transform window (i.e.,
traditional census, sparse census, or generalized census), the neighborhood/graph being employed,
the size of the correlation window, and, of course, the set T that defines the multi-bit transform.
In order to evaluate the performance of the multi-bit census, 18 different configurations of T were
evaluated with 12 different census windows, and 5 different correlation window sizes, for a total
of 1, 080 different configurations run on all 8 evaluation image pairs.
Three tables are presented which show the results for various multi-bit census configurations. Within each table are the results for 7 different sets T . Each table was computed using
a different census graph and correlation window size. This effectively presents the data along
two different axes, the first being the base census configuration and the second being the set T .
To reduce the amount of data needed to highlight how accuracy varies with the parameters, only

136

Table 7.2: Multi-bit Census Transform Correlation Accuracy Comparison. All methods use the
4-edge generalized census transform of Figure 5.8(d) with a 9 × 9 correlation window.

Dataset

Gen.
Census

(-1,0,1)

(-2,0,2)

(4,0,4)

(-2,-1,
0,1,2)

(-4,-2,
0,2,4)

(-6,-3,
0,3,6)

Tsukuba
Venus
Teddy
Cones
Average
Rocks
Baby
Wood
Cloth
Average

88.79
95.29
86.95
88.02
88.76
91.42
91.55
87.35
87.23
89.39

89.98
96.03
87.25
88.06
90.33
91.47
91.92
89.13
87.41
89.98

90.71
95.73
87.05
88.01
90.37
91.47
91.89
89.06
87.46
89.97

90.40
95.11
86.75
87.84
90.02
91.45
91.59
88.32
87.43
89.70

90.68
95.97
87.21
88.04
90.48
91.48
91.96
89.49
87.47
90.10

90.88
95.45
86.85
87.93
90.28
91.47
91.70
89.20
87.47
89.96

90.82
95.00
86.57
87.70
90.02
91.37
91.50
88.68
87.42
89.74

the results for the generalized census are shown. The trends described below are similar for the
traditional census and sparse census.
Beginning with Table 7.2, we see the correlation results for several sets T when combined
with the 4-edge generalized census transform of Figure 5.8(d) and a 9 × 9 correlation window. This
data illustrates how the correlation accuracy can be improved using the multi-bit census transform.
At the other end of the spectrum, Table 7.4 shows the correlation results for the same sets T when
combined with the 16-edge generalized census transform of Figure 5.8(a) and a 13 × 13 correlation window. This example illustrates how, in other cases, the multi-bit census offers little or no
improvement in correlation accuracy. Table 7.3 shows the results of an intermediate configuration
using the 8-edge generalized census transform of Figure 5.8(c) with an 11 × 11 correlation window. Data for cases where |T | > 5 are not shown because the correlation accuracy did not improve
beyond that achieved with |T | = 5.
The multi-bit census offers the most improvement over the traditional single-bit census in
cases where the the census window is especially sparse and/or the correlation window size is too
small. In other words, the multi-bit census can help compensate for a smaller number of census
comparisons by providing greater discrimination between pixels. For example, the 4-edge census
transform with 9 × 9 correlation window (Table 7.2) uses a relatively sparse transform graph and
a correlation window that is too small to maximize correlation accuracy. As a result, the single137

Table 7.3: Multi-bit Census Transform Correlation Accuracy Comparison. All methods use the
8-edge generalized census transform of Figure 5.8(c) with an 11 × 11 correlation window.

Dataset

Gen.
Census

(-1,0,1)

(-2,0,2)

(4,0,4)

(-2,-1,
0,1,2)

(-4,-2,
0,2,4)

(-6,-3,
0,3,6)

Tsukuba
Venus
Teddy
Cones
Average
Rocks
Baby
Wood
Cloth
Average

91.64
96.80
87.17
87.34
90.74
90.97
91.88
89.49
87.03
89.84

92.00
96.75
87.17
87.38
90.83
90.98
91.99
89.72
87.08
89.94

92.11
96.45
86.95
87.34
90.71
90.99
91.93
89.71
87.06
89.92

92.03
96.06
86.56
87.15
90.45
90.93
91.66
89.51
86.98
89.77

92.19
96.56
87.03
87.36
90.79
91.00
91.97
89.71
87.08
89.94

92.16
96.11
86.64
87.21
90.53
90.95
91.72
89.61
87.02
89.83

92.08
95.85
86.35
87.01
90.33
90.88
91.38
89.41
86.98
89.66

Table 7.4: Multi-bit Census Transform Correlation Accuracy Comparison. All methods use the
16-edge generalized census transform of Figure 5.8(a) with a 13 × 13 correlation window.

Dataset

Gen.
Census

(-1,0,1)

(-2,0,2)

Tsukuba
Venus
Teddy
Cones
Average
Rocks
Baby
Wood
Cloth
Average

92.28
96.87
86.91
87.01
90.77
90.58
91.86
89.45
86.71
89.65

92.55
96.73
86.77
87.00
90.76
90.60
91.96
89.45
86.72
89.69

92.61
96.42
86.52
86.95
90.62
90.58
91.89
89.33
86.70
89.63

(4,0,4)

(-2,-1,
0,1,2)

(-4,-2,
0,2,4)

(-6,-3,
0,3,6)

92.45
95.96
86.08
86.73
90.30
90.52
91.55
89.14
86.65
89.46

92.70
96.52
86.58
86.97
90.69
90.60
91.92
89.37
86.71
89.65

92.51
96.02
86.16
86.80
90.37
90.53
91.64
89.18
86.69
89.51

92.50
95.75
85.84
86.60
90.17
90.47
91.24
89.02
86.62
89.34

bit census transform does not perform as well using these parameters. In this case, the multi-bit
version offers a small improvement in correlation accuracy. At the other end of the spectrum,
the 16-edge transform with 13 × 13 correlation window already uses much more information to
generate the census vectors. In this case, the multi-bit census offers little or no improvement.
The benefit of the multi-bit census transform also diminishes as the spacing between the
elements of T increase and, therefore, as the maximum value of |T | increases. This is shown in
138

more detail in Figures 7.1 and 7.2, which show the correlation accuracy for various threshold sets
where |T | = 3 and |T | = 5, respectively, using a 4-edge census transform and a 9 × 9 correlation
window. Note that the accuracies presented at x = 0 correspond to the original 1-bit census (i.e.,
T = (0)). From these figures we see that the correlation accuracy increases slightly for small
magnitude thresholds but decreases as the threshold magnitudes increase. Although not shown,
if we were to continue these plots for even larger values of x, we would see that the correlation
accuracy reaches a minimum then gradually approaches the accuracy of the 1-bit census.

96

Correct Disparities (%)

94

Tsukuba
Venus
Teddy
Cones
Average

92

90

88

86
0

1

2

3
4
5
6
7
Threshold Set, T = (−x, 0, x)

8

9

10

Figure 7.1: Correlation accuracy versus threshold set for |T | = 3. Each value
along the x axis corresponds to the set T = (−x, 0, x) using the 4-edge generalized census transform of Figure 5.8(d) with a 9 × 9 correlation window.

7.5

Cost Versus Benefit
Section 7.4 showed that, in some cases, the multi-bit census transform does offer a small

benefit in correlation accuracy. It is therefore important to consider how the improvement in cor-

139

96

Correct Disparities (%)

94
Tsukuba
Venus
Teddy
Cones
Average

92

90

88

86
0

1

2
3
4
5
6
7
8
Threshold Set, T = (−2x, −x, 0, x, 2x)

9

10

Figure 7.2: Correlation accuracy versus threshold set for |T | = 5. Each value
along the x axis corresponds to the set T = (−2x, −x, 0, x, 2x) using the 4-edge
generalized census transform of Figure 5.8(d) with a 9 × 9 correlation window.

relation accuracy provided by the multi-bit census stereo method compares to the amount of additional hardware resources required for its implementation.
In general, because the n-bit census transform requires n times the number of bits to be
represented as the original 1-bit census, a stereo vision correlation system based on the multi-bit
census transform will require roughly n times the amount of hardware resources to implement as
a system using the 1-bit census with the same transform graph and correlation window. Similarly,
choosing a census transform graph that uses n times the number of edges will result in a stereo
correlation system that uses roughly n times the amount of hardware, assuming all other parameters
remain the same.
Although the resource requirements could be similar, the n-bit multi-bit census transform
will likely require slightly more hardware to implement than the 1-bit census with a census graph
having n times as many edges. This is because the n-bit census transform, in general, requires
2n − 1 comparisons for each edge in the census graph followed by a logic tree to encode the result
of the threshold comparisons, whereas the 1-bit census only requires 1 comparison per edge. Even
140

Table 7.5: Hardware Resource Benefit for the Multi-bit Census
Method
4-edge, 2-bit (-2,0,2)
8-edge, 1-bit
4-edge, 3-bit (-2,-1,0,1,2)
12-edge, 1-bit

Average Accuracy
90.17
90.11
90.29
90.18

if we limit the elements of T to be evenly spaced multiples of 2k and apply the optimization
suggested in Section 7.3, we still must saturate the numeric result for each edge, which adds to the
hardware costs. Additionally, during the correlation step, the multi-bit census must compute the
absolute difference to compare transformed pixels instead of using the simpler XOR operation.
The data in Section 7.4 suggest that transforms where |T | = 5 provide slightly better results
in some cases than where |T | = 3. However, |T | = 5 requires a 3-bit census encoding instead of
the 2 bits required for |T | = 3. As a result, we would expect the hardware resource requirements
to be at least 50% greater for |T | = 5. Given the very small increase in accuracy, the increase in
hardware resource requirements may not be cost effective for some applications.
Table 7.5 compares the average correlation accuracy for the multi-bit implementation of
the 4-edge generalized census with a 9 × 9 correlation window (i.e., Table 7.2) against the correlation accuracy of the original 8-edge and 12-edge generalized census transforms using the same
correlation window size. These data suggest that the multi-bit census offers very close to the same
correlation accuracy as a single-bit census that uses a similar amount of hardware. Therefore, in
terms of hardware cost, the multi-bit census offers at best only a very small improvement over the
1-bit census.

7.6

Summary
This chapter introduced the multi-bit census transform. This transform is an extension of

the sparse and generalized census transforms described previously that achieves better correlation accuracy for some stereo configurations at the expense of increased resource requirements.
Specifically, it works very well in situations where the correlation window size is restricted and the
transform graph is very sparse. However, the improvement in correlation accuracy is not signifi141

cantly better than using a less sparse transform that requires a similar amount of hardware. As a
result, there does not appear to be a clear benefit to using the multi-bit census transform for stereo
vision applications in resource-constrained systems.

142

CHAPTER 8.

A ROTATION-INVARIANT CENSUS TRANSFORM

The census transform has found useful application in stereo vision due to its relative immunity to errors caused by disparity discontinuities and other image defects. The transform has
also proven useful in many other machine-vision applications, such as face detection [87], hand
posture classification [88], optical flow estimation [89], and others.
One disadvantage of the original census transform that limits its application is the fact that
it does not tolerate image rotation. In general, using the original census transform to match image
locations that could be rotated requires the reference image to be rotated and the transform be
computed for each rotation under consideration. This requires many image rotations and many
transforms to be computed as well as many transformed images to be stored.
In this chapter, a rotation-invariant census transform is proposed. Based on the generalized
transform, the rotation-invariant transform only needs to be computed once, rather than once for
each rotation to be considered. Subsequent rotations can then be achieved using trivial bit-shifting
operations.

8.1

Definition
In the bit vector that results from the census transform of a pixel, each bit corresponds to

a comparison between two pixels. As a result, each bit has a spatial association with locations in
the census transform window. We can therefore create a rotation-invariant census transform by
designing a census graph that is identical when rotated by any multiple of some minimum rotation
factor, θ , where θ is an angle that evenly divides 360◦ . With such a graph, the census transform
can effectively be rotated simply by rotating the census vector using a simple bit-shift operation.
Note that bit shifting does not actually change the data bits, but simply changes their location.
As a result, a census transform that meets this criteria only needs to be computed once because it
already contains the transform for 360/θ discrete rotations.

143

Consider for example the 24-edge census graph of Figure 8.1(a). In this figure, each graph
edge and the associated census bit are labeled, showing the relationship between the bit vector
positions and spatial positions within the transform. Figure 8.1(b) shows the result of the census
transform after having been bit-shifted using a right bit rotation. Note that this bit-shifting operation is equivalent to rotating the image by the same number of rotation factors then computing
the census transform. However, it is dramatically more efficient computationally, being essentially
free in a hardware implementation. In this example, each shift by one bit is equivalent to a rotation
factor of 15◦ .
The transform is not limited to 360/θ edges, although this is the most obvious case. Figure 8.2 shows a different example where there are two edges per slice. In this case, shifting the
census vector by two bits is equivalent to a rotation of 45◦ . Note that shifting by one bit is not
meaningful with this transform, since this effectively swaps the rolls of each bit between representing a spoke in the graph and representing an outer edge. In other words, this fundamentally
changes the orientation of the edges relative to each other, rather than simply rotating the graph by
θ about a single point. Thus, for a transform having n edges, we must shift the census bit vector
by nθ /360 to achieve a rotation of θ .
In general, a rotation-invariant graph can be created with any number of bits that is a multiple of 360/θ . The graphs of Figures 8.1–8.2 serve as illustrative examples. Many other graphs
are possible, limited only by the radius and the granularity of θ that can reasonably be achieved in
the discretized image.

8.2

Graph Point Selection
Because the image has a discrete set of pixel positions, it is logical to adjust the locations

of the endpoints of the graph edges to the nearest discrete pixel locations. As a result, there will
be some deviation from the ideal graph locations in terms of the rotation factor and radius, and
the actual census transform implementation will only approximate the rotations of the ideal graph.
Note that it is also possible to interpolate between pixels to estimate the pixel values at the ideal
endpoint locations, but this introduces a significant amount of complexity to the transform and will
not be explored here.

144

A

X

W

B
C

V

D

U

E

T

F
S
G

R

H

Q
I

P
J

O
N

K

L

M

Census Bit Vector
A B C D E F G H I J K L M N O P Q R S T U VWX

(a)

U

T

S

V
W

R

X

Q

A

P

B
O
C

N

D

M
E

L
F

K
J

H

I

G

Census Bit Vector
U VWX A B C D E F G H I J K L M N O P Q R S T

(b)

Figure 8.1: Rotation-invariant census transform graph. (a) Original graph. (b) Graph
rotated by 60◦ .

145

B

P
A
O

C
D

N

M
E
F

L
K
G
I
H

J

Census Bit Vector
A B C D E F G H I J K L M N O P

Figure 8.2: Alternative rotation-invariant census graph.

In order to ensure the rotation invariance of the census, we must choose points for the graph
that closely approximate a circle. One obvious discretized census graph would be one that makes
use of the traditional equation for a circle:
x2 + y2 = r2 .

(8.1)

This equation is called the implicit form. Using this equation, we can solve for discrete
values of x and y that most closely fall along the circle, such as would be done using a traditional
circle drawing algorithm in computer graphics. This results in a circle of discrete points like that
for Figure 8.3(a), which shows the circle for r = 6 and θ = 360/32 = 11.25◦ .
The disadvantage of choosing the points in this way is that the distance between the discrete
pixel locations and the ideal locations tend to increase as the radius increases. Figure 8.4 shows the
average distance between the discretized point and the ideal point when using the implicit equation
with different values of r. If we wish to minimize the deviation between our discrete point selection

146

and the ideal circle of points then the parametric equations for a circle are more appropriate:
x = r cos(φ ),

(8.2)

y = r sin(φ ).
Using these equations, we choose the number of edges, n, the radius, r, and let θ = 360/n.
We then plug in multiples of θ and solve for the (x, y) coordinates that most closely approximate
the ideal points of the transform graph. Figure 8.3(b) shows such a graph for r = 6.

(a)

(b)

Figure 8.3: Discrete circular graphs for r = 6. (a) Graph obtained using the implicit
equation. In this case n = 32 and θ = 11.25◦ . (b) Graph obtained using the parametric
equations. In this case n = 36 and θ = 10. The dots along the circle mark the ideal
locations.

Less obvious is the selection of the number of points, n. One possibility is to simply choose
the number of points that most closely leads to the value of θ needed for the application. When the
smallest θ possible is needed, we can choose a value of n that guarantees there will be no gaps in
the circle, such as the circumference, n = 2πr. Doing so results in some redundancy in the graph
since two consecutive points along the circle may fall within the same pixel.

147

Table 8.1: Number of Graph Points that Minimizes Deviation for a Circular Graph
r
2
3
4
5
6
7

n
12
16
24
28
36
40

r
8
9
10
11
12
13

n
48
52
60
68
72
80

r
14
15
16
17
18
19

n
84
92
96
104
108
116

r
20
21
22
23
24
25

n
120
128
136
140
148
152

In an attempt to maximize the accuracy of the census transform, we can choose the value n
near 2πr that yields the smallest average deviation between the selected points and the ideal graph.
To this end, a program was written that finds the average deviation for different values of r and n.
Table 8.1 shows the the values of n within 30% of 2πr that yield the minimum average distance
between the discrete point locations and the ideal locations.
Notice that the optimal number of points is always a multiple of four. Using a multiple of
four guarantees that 90◦ rotations are identical to the original graph. In general the number of points
that minimizes deviation is roughly linear and can be approximated by n = 4b2πr/4c. Figure 8.4
shows the average deviation using the parametric equations and using the optimal number of graph
points to minimize the deviation. As shown, this method results in a lower deviation from the ideal
graph than using the implicit equations.

8.3

Accuracy of the Rotation-Invariant Census Transform
In order for the rotation-invariant census transform to be useful, the transformed pixels of

the rotated image must closely match the transformed pixels of the unrotated image. To test this,
we begin by computing the rotation-invariant census transform of a pixel within an image. We then
choose a random rotation, φ , between 0◦ and 360◦ and then shift the census transform vector by
the amount that most closely represents the angle φ . Finally, we compute the census transform of
the image after rotating it by φ about the pixel using bicubic interpolation. The shifted version of
the census transform of the original image should match the census transform of the rotated image.
For this testing, we use the test images of Figure 2.5, randomly selecting 500 points from each

148

0.7
Implicit
Parametric

0.6

Average Deviation

0.5

0.4

0.3

0.2

0.1

0

2

4

6

8

10

12
14
Radius

16

18

20

22

24

26

Figure 8.4: Average deviation from ideal graph using the implicit and parametric equations.
image and applying 8 random rotations to each. The circular graph having minimum deviation, as
described in Section 8.2, has been used.
Figure 8.5 shows the fraction of census bits that do not exactly match, on average, for
each radius. Note that a completely random image would result in an average of 0.5 (50%) being
incorrect. For all images and all radii, the fraction of incorrect census bits is shown to be less than
10%. The data shows that the average error decreases as the radius increases, which is consistent
with the fact that the rotation factor θ is smaller for a larger radius.

8.4

Uniqueness of the Rotation-Invariant Census Transform
One obvious application of the rotation-invariant census transform is the matching of pix-

els. Therefore, an important metric is the uniqueness of the census vector. In other words, we need
to answer the question of how many other rotations of a census vector match the census vector
for the correct rotation. Figure 8.6 shows the average number of census vector rotations that have
the same or lower Hamming distance than the correctly rotated census. This data was obtained
by selecting 500 random points from each image and applying four random rotations around each

149

Tsukuba
Venus
Teddy
Cones
Average

Fraction of Incorrect Census Bits

0.09
0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0

2

4

6

8

10

12
14
Radius

16

18

20

22

24

26

Figure 8.5: Average fraction of census bits that are incorrect.

point. As we can see, the correct census rotation is not unique, and there will generally be several
matches. As a result, this rotation-invariant census transform, by itself, cannot generally be used
to uniquely identify pixels.
This should come as no surprise, since the same is true of the original census transform. For
this reason, the original census transform of a pixel is not typically used by itself to identify a pixel
match. Instead, we aggregate the results from several census transformed pixels near the point of
comparison. For example, in the census stereo method, we summed the Hamming distances over a
window to obtain an aggregate error for the region around the pixel under consideration. The same
is possible with the rotation-invariant census transform, although in this case the locations of the
pixels to be summed must also be rotated.
Another factor in the lack of uniqueness of the rotation-invariant census transform vector
is that adjacent rotations tend to be similar. In other words, even when the match is not unique,
the other matches with the same or lower Hamming distance tend to be adjacent rotations. This is
demonstrated in Figure 8.7, which shows the average angle by which the matches deviate from the
correct rotation. Deviation decreases as the radius increases because the minimum rotation factor
also decreases as the radius increases.

150

9
8

Number of Matches

7
6
5
4
Tsukuba
Venus
Teddy
Cones
Average

3
2
1
0

2

4

6

8

10

12
14
Radius

16

18

20

22

24

26

Figure 8.6: Rotation-invariant census transform uniqueness. This graph shows the average number of census vector rotations that have a Hamming distance that is less than or
equal to that of the correct match.

Tsukuba
Venus
Teddy
Cones
Average

Average Deviation in Degrees

30

25

20

15

10

5

0

2

4

6

8

10

12
14
Radius

16

18

20

22

24

Figure 8.7: Average deviation from the correct match.

151

26

8.5

A Rotation-Invariant Rank Transform
Similar to the census transform, a rotation-invariant rank transform can also be defined.

Recalling that the rank transform is equivalent to the population count of the census transform
vector, the rotation-invariant rank transform is simply the population count of the rotation-invariant
census transform. Therefore, the same graph examples that were shown for the rotation-invariant
census transform apply equally to the rank.
The ideal rotation-invariant rank transform does have the nice property that it is truly invariant to rotation. This is unlike the census transform, which must be bit-shifted to obtain the
different rotations. However, this feature comes at the price of reduced discrimination between
pixels, since many unique census transform vectors have the same population count.

8.6

Summary
This chapter has proposed a rotation-invariant census transform that can be used in situa-

tions where image rotation is required. Using this new transform, the reference image only needs
to be transformed once and different rotations can be obtained by bit-shifting the resulting census
transform vector. This results in dramatically reduced computation compared to the original census transform, which required the image to be rotated and the census transform to be computed for
each rotation under consideration. The discriminatory properties of the rotation-invariant census
transform have also been presented, showing that, on average, the census transform of a rotated
image closely matches the result obtained with the rotation-invariant census transform.

152

CHAPTER 9.

CONCLUSIONS AND FUTURE WORK

Stereo vision is a difficult and computationally demanding problem with many useful applications. FPGAs have been shown to be well-suited for the implementation of a specific class of
stereo algorithms known as local methods. Of the local methods, the census and rank methods have
been shown to deliver superior correlation accuracy. At the same time, these algorithms are particularly well-suited to FPGA implementation due to the large number of bit-level manipulations
required, which are less efficiently executed on most other computing platforms, such as DSPs and
general-purpose software-programmable processors.
This dissertation has extended the previous work done on the rank and census stereo methods by introducing a variety of sparse census transforms that dramatically reduce the computational
complexity of the census stereo method while preserving its correlation accuracy. The characteristics of the best transforms have been described and it has been shown that these transforms can
be optimized for the characteristics of the images. This dissertation has also proposed and analyzed the sparse rank transform, which also has reduced computational complexity compared to
the original rank transform yet offers similar correlation accuracy.
This dissertation then introduced the generalized census and rank transforms. These new
transforms are supersets of the original census and rank as well as the sparse census and rank. The
generalized transforms allow for a greater level of flexibility in selecting the optimal transform
while making it easier to achieve greater symmetry and reduced redundancy in the aggregation
step of the stereo method. Several examples of the generalized census and rank transforms have
been evaluated, resulting in correlation accuracy similar to that of the original transforms while
minimizing the hardware resource requirements of the stereo system.
A hardware architecture has been described that introduces several key optimizations intended to minimize the resource requirements of the stereo system while maximizing throughput.
The resource requirements for the proposed sparse census, sparse rank, generalized census, and

153

generalized rank algorithms have been shown using this architecture. The resource requirements
for the original census, original rank, and SAD stereo methods have also been shown. These results
confirm that dramatic hardware resource savings are possible, with minimal impact on correlation
accuracy.
This dissertation has also proposed the multi-bit census transform. This is a new transform
that can be used in sparse or generalized forms where the pixel comparison function has been replaced to provide a greater level of pixel discrimination. It has been shown that the multi-bit census
transform results in improved correlation accuracy in situations where the correlation window is
small and the transform graph is sparse. However, little benefit was seen in situations where the
transform window is sufficiently dense and the correlation window is sufficiently large.
A rotation-invariant census transform was also proposed. This transform can potentially be
used in applications where images may be rotated relative to each other. It has been shown that
the proposed rotation-invariant census transform provides results similar to the census transform
of the rotated image.

9.1

Future Work
There are several avenues for future research that could be explored in relation to the work

of this dissertation. It has been shown in this work that the sparse and generalized transforms can
be optimized for the characteristics of the image, such as the level of noise present. The level
of noise in the images from a camera can change as lighting changes and camera settings adjust.
Furthermore, in some applications, the characteristics of the scene may change from time to time.
Therefore, the optimal transform under these circumstances would be one that adjusts dynamically
to adapt to the characteristics of the images. To accomplish this, methods for evaluating the amount
of noise in the image and determining the characteristics of the scene would be necessary, in
addition to methods for choosing the optimal transform based on those characteristics. Since the
amount of logic required to implement the proposed census and rank transforms is small relative to
the rest of the stereo system, a dynamically adapting transform could in theory be developed with
a relatively small impact on the size of the overall correlation system.
Finding the best transform to use for a specific application is an optimization problem.
Efficient methods for finding the optimal transform for a specific set of images or general image
154

characteristics could also be explored. Furthermore, the ability to determine the optimal transform
under a specific set of constraints (e.g., maximum transform size, maximum graph edges, and/or
maximum correlation window size) would also be useful. This would allow for maximum correlation accuracy to be achieved while keeping the resource requirements within the limits demanded
by the application.
This work has focused primarily on the stereo correlation system. Many real-world implementations will also employ a number of post-processing steps to remove incorrect matches
from the resulting disparity image. Although many existing post-processing methods can be applied without changes to the proposed stereo methods, some are dependent on the specific stereo
method being used. The effectiveness of these post-processing methods in the context of the sparse
and generalized transforms has not been explored, nor has the impact on overall resource requirements been evaluated. Post-processing methods designed specifically for the sparse and generalized methods could also be developed and analyzed.
This work has focused on the stereo vision problem. Fundamentally, the sparse and generalized transforms are tools for evaluating the similarity of points in two or more images. This
concept is used in a variety of machine vision problems, including image registration, optical flow,
feature tracking, pattern matching, and so on. The usefulness of the generalized, sparse, and multibit transforms could also be explored in these other applications.

155

156

REFERENCES

[1] D. Scharstein and R. Szeliski, “A taxonomy and evaluation of dense two-frame stereo correspondence algorithms,” International Journal of Computer Vision, vol. 47, no. 1–3, pp.
7–42, 2002. 2, 21, 38, 68, 187, 188
[2] R. Zabih, Individuating Unknown Objects by Combining Motion and Stereo, August 1994,
PhD dissertation, Stanford University. 2, 41, 45
[3] W. J. Dally, U. J. Kapasi, B. Khailany, J. H. Ahn, and A. Das, “Stream processors:
Progammability with efficiency,” Queue, vol. 2, no. 1, pp. 52–62, 2004. 5, 12
[4] S. Mendis, S. Kemeny, R. Gee, B. Pain, C. Staller, Q. Kim, and E. Fossum, “CMOS active
pixel image sensors for highly integrated imaging systems,” IEEE Journal of Solid-State
Circuits, vol. 32, no. 2, pp. 187–197, February 1997. 6
[5] R. Zabih and J. Woodfill, “Non-parametric local transforms for computing visual correspondence,” in Proceedings of the third European conference on Computer Vision (ECCV).
Secaucus, NJ, USA: Springer-Verlag New York, Inc., 1994, pp. 151–158. 8, 25, 38, 112,
115
[6] G. van der Wal, M. Hansen, and M. Piacentino, “The Acadia vision processor,” Proceedings
of the Fifth IEEE International Workshop on Computer Architectures for Machine Perception, pp. 31–40, 2000. 8
[7] J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, and J. C. Phillips, “GPU
computing,” Proceedings of the IEEE, vol. 96, no. 5, pp. 879–899, May 2008. 10, 11
[8] M. Gokhale, J. Cohen, A. Yoo, W. M. Miller, A. Jacob, C. Ulmer, and R. Pearce, “Hardware
technologies for high-performance data-intensive computing,” Computer, vol. 41, no. 4, pp.
60–68, April 2008. 11
[9] J. Chase, B. Nelson, J. Bodily, Z. Wei, and D. J. Lee, “FPGA and GPU architectures for
real-time optical flow calculations: A comparison study,” in IEEE Symposium on FieldProgrammable Custom Computing Machines, Palo Alto, CA, USA, April 2008. 11
[10] J. Lu, G. Lafruit, and F. Catthoor, “Fast variable center-biased windowing for high-speed
stereo on programmable graphics hardware,” IEEE International Conference on Image Processing (ICIP), vol. 6, pp. VI–568–VI–571, October 2007. 11, 232
[11] A. Brunton, C. Shu, and G. Roth, “Belief propagation on the GPU for stereo vision,” The
3rd Canadian Conference on Computer and Robot Vision, pp. 76–76, 7–9 June 2006. 11

157

[12] R. Yang and M. Pollefeys, “Multi-resolution real-time stereo on commodity graphics hardware,” Proceedings of the IEEE Computer Society Conference on Computer Vision and
Pattern Recognition, vol. 1, pp. I–211–I–217, 18–20 June 2003. 11
[13] J. Ohmer, F. Maire, and R. Brown, “Real-time tracking with non-rigid geometric templates
using the GPU,” International Conference on Computer Graphics, Imaging and Visualisation, pp. 200–206, 26–28 July 2006. 11
[14] J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and D. Shippy, “Introduction to the cell multiprocessor,” IBM J. Res. Dev., vol. 49, no. 4/5, pp. 589–604, 2005.
11
[15] J. Kurzak and A. Buttari, “Introduction to programming high performance applications on
the CELL Broadband Engine,” 15th Annual IEEE Symposium on High-Performance Interconnects (HOTI), August 22–24, 2007. 11
[16] B. Khailany, W. J. Dally, U. J. Kapasi, P. Mattson, J. Namkoong, J. D. Owens, B. Towles,
A. Chang, and S. Rixner, “Imagine: Media processing with streams,” IEEE Micro, vol. 21,
no. 2, pp. 35–46, 2001. 12
[17] J. H. Ahn, W. Dally, B. Khailany, U. Kapasi, and A. Das, “Evaluating the Imagine stream
architecture,” Proceedings of the 31st Annual International Symposium on Computer Architecture, pp. 14–25, June 19–23, 2004. 12
[18] M. Butts, A. Jones, and P. Wasson, “A structural object programming model, architecture, chip and tools for reconfigurable computing,” 15th Annual IEEE Symposium on FieldProgrammable Custom Computing Machines (FCCM), pp. 55–64, April 23–25, 2007. 12
[19] D. Wentzlaff, P. Griffin, H. Hoffmann, L. Bao, B. Edwards, C. Ramey, M. Mattina, C.-C.
Miao, J. Brown, and A. Agarwal, “On-chip interconnection architecture of the Tile processor,” IEEE Micro, vol. 27, no. 5, pp. 15–31, September–October, 2007. 12
[20] D. Helgemo, “Digital signal processing at 1 GHz in a field-programmable object array,” Proceedings of the IEEE International Systems-on-Chip (SOC) Conference, pp. 57–60, September 17–20, 2003. 13
[21] A. Marshall, T. Stansfield, I. Kostarnov, J. Vuillemin, and B. Hutchings, “A reconfigurable
arithmetic array for multimedia applications,” in Proceedings of the 1999 ACM/SIGDA seventh international symposium on Field programmable gate arrays. New York, NY, USA:
ACM, 1999, pp. 135–143. 13
[22] T. Stansfield, “Using multiplexers for control and data in D-Fabrix.” in Field Programmable
Logic and Applications (FPL), ser. Lecture Notes in Computer Science, P. Y. K. Cheung,
G. A. Constantinides, and J. T. de Sousa, Eds., vol. 2778. Springer, 2003, pp. 416–425. 13
[23] P. Master, “Reconfigurable hardware and software architectural constructs for the enablement of resilient computing systems,” International Conference on Application-specific Systems, Architectures and Processors, pp. 50–55, Sept. 2006. 13

158

[24] S. Kelem, B. Box, S. Wasson, R. Plunkett, J. Hassoun, and C. Phillips, An Elemental Computing Architecture for SD Radio, Element CXI, Inc., 2007, white paper. 13
[25] J. M. Arnold, “Software configurable processors,” IEEE International Conference on
Application-specific Systems, Architectures and Processors (ASAP), pp. 45–49, September
2006. 13
[26] R. Razdan and M. Smith, “A high-performance microarchitecture with hardware-programmable functional units,” Proceedings of the 27th Annual International Symposium on Microarchitecture, pp. 172–180, November 30–December 2, 1994. 13
[27] R. Wittig and P. Chow, “OneChip: an FPGA processor with reconfigurable logic,” Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines (FCCM), pp.
126–135, April 17–19, 1996. 13
[28] R. Gonzalez, “Xtensa: a configurable and extensible processor,” Micro, IEEE, vol. 20, no. 2,
pp. 60–70, March/April 2000. 13
[29] P. Chow, S. O. Seo, J. Rose, K. Chung, G. Paez-Monzon, and I. Rahardja, “The design of
an SRAM-based field-programmable gate array—Part I: Architecture,” IEEE Transactions
on Very Large Scale Integration (VLSI) Systems, vol. 7, no. 2, June 1999. 16
[30] ——, “The design of a SRAM-based field-programmable gate array—Part II: Circuit design
and layout,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 7,
no. 3, pp. 321–330, September 1999. 16
[31] M. Herbordt, T. VanCourt, Y. Gu, B. Sukhwani, A. Conti, J. Model, and D. DiSabello,
“Achieving high performance with FPGA-based computing,” Computer, vol. 40, no. 3, pp.
50–57, March 2007. 16
[32] P. Graham and B. Nelson, “Genetic algorithms in software and in hardware-a performance
analysis of workstation and custom computing machine implementations,” Proceedings of
the IEEE Symposium on FPGAs for Custom Computing Machines, pp. 216–225, 17–19
April 1996. 17
[33] ——, “FPGA-based sonar processing,” in Proceedings of the 1998 ACM/SIGDA sixth international symposium on Field programmable gate arrays (FPGA). New York, NY, USA:
ACM, 1998, pp. 201–208. 17
[34] W. MacLean, “An evaluation of the suitability of FPGAs for embedded vision systems,”
IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 3,
20-26 June 2005. 18
[35] I. Kuon and J. Rose, “Measuring the gap between FPGAs and ASICs,” IEEE Transactions
on Computer-Aided Design of Integrated Circuits and Systems, vol. 26, no. 2, pp. 203–215,
February 2007. 18
[36] M. Brown, D. Burschka, and G. Hager, “Advances in computational stereo,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no. 8, pp. 993–1008, Aug.
2003. 21, 210
159

[37] L. D. Stefano, M. Marchionni, S. Mattoccia, and G. Neri, “A fast area-based stereo matching algorithm,” in 15th IAPR/CIPRS International Conference on Vision Interface, Calgary,
Canada, May 27–29 2002. 22, 23
[38] W. van der Mark and D. Gavrila, “Real-time dense stereo for intelligent vehicles,” IEEE
Transactions on Intelligent Transportation Systems, vol. 7, no. 1, pp. 38–50, March 2006.
22, 29, 241
[39] L. Di Stefano, M. Marchionni, and S. Mattoccia, “Real-time dense stereo on a personal computer,” International Workshop on Computer Architectures for Machine Perception, May
2003. 22
[40] P. Fua, “A parallel stereo algorithm that produces dense depth maps and preserves image
features,” Machine Vision and Applications, vol. 6, pp. 34–49, 1993. 22
[41] S. Gautama, S. Lacroix, and M. Devy, “Evaluation of stereo matching algorithms for occupant detection,” Proceedings of the International Workshop on Recognition, Analysis, and
Tracking of Faces and Gestures in Real-Time Systems, pp. 177–184, 1999. 22, 26, 241
[42] J. Banks and P. Corke, “Quantitative evaluation of matching methods and validity measures
for stereo vision,” International Journal of Robotics Research, vol. 20, no. 7, pp. 512–532,
July 2001. 22, 192, 207, 209, 210, 240, 241
[43] T. Kanade and M. Okutomi, “A stereo matching algorithm with an adaptive window: Theory
and experiment,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 16,
no. 9, pp. 920–932, 1994. 23, 222, 342
[44] A. Fusiello, V. Roberto, and E. Trucco, “Symmetric stereo with multiple windowing,” International Journal of Pattern Recognition and Artificial Intelligence, vol. 14, no. 8, pp.
1053–1066, December 2000. 23, 38, 222, 225, 351
[45] H. Hirschmüller, “Improvements in real-time correlation-based stereo vision,” Proceedings
of the IEEE Workshop on Stereo and Multi-Baseline Vision, pp. 141–148, 2001. 24, 38, 68,
222, 227, 229, 241, 347
[46] H. Hirschmüller and D. Scharstein, “Evaluation of cost functions for stereo matching,” IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–8, June 2007. 25,
207
[47] D. Bhat and S. Nayar, “Ordinal measures for image correspondence,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 20, no. 4, pp. 415–423, April 1998. 25, 26
[48] S. Scherer, P. Werth, and A. Pinz, “The discriminatory power of ordinal measures - towards
a new coefficient,” IEEE Computer Society Conference on Computer Vision and Pattern
Recognition, vol. 1, 1999. 26
[49] J. Banks, M. Bennamoun, K. Kubik, and P. Corke, “A constraint to improve the reliability
of stereo matching using the rank transform,” Proceedings of the International Conference
on Acoustics, Speech, and Signal Processing, vol. 6, pp. 3321–3324, March 1999. 27

160

[50] J. Banks and M. Bennamoun, “Reliability analysis of the rank transform for stereo matching,” IEEE Transactions on Systems, Man, and Cybernetics, Part B, vol. 31, no. 6, pp.
870–880, December 2001. 27
[51] http://vision.middlebury.edu/stereo. 28, 65
[52] D. Scharstein and R. Szeliski, “High-accuracy stereo depth maps using structured light.” in
IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1,
Madison, WI, USA, June 2003, pp. 195–202. 28
[53] M. F. Tappen and W. T. Freeman, “Comparison of graph cuts with belief propagation for
stereo, using identical MRF parameters,” in IEEE International Conference on Computer
Vision (ICCV), Nice, France, October 2003. 35
[54] Y. Boykov, O. Veksler, and R. Zabih, “Fast approximate energy minimization via graph
cuts,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 11, pp.
1222–1239, Nov 2001. 36
[55] J. Sun, N.-N. Zheng, and H.-Y. Shum, “Stereo matching using belief propagation,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no. 7, pp. 787–800,
July 2003. 36
[56] P. F. Felzenszwalb and D. P. Huttenlocher, “Efficient belief propagation for early vision,”
International Journal of Computer Vision, vol. 2006, October 2006. 36
[57] O. Faugeras, B. Hotz, H. Mathieu, T. Viéville, Z. Zhang, P. Fua, E. Théron, L. Moll,
G. Berry, J. Vuillemin, P. Bertin, and C. Proy, Real time correlation-based stereo: algorithm, implementations and applications, Unité de recherche INRIA Sophia-Antipolis, August 1993, research Report 2013. 40, 111, 113
[58] B. Cyganek, “Matching of the multi-channel images with improved nonparametric transformations and weighted binary distance measures,” in 11th International Workshop on Combinatorial Image Analysis (IWCIA), Berlin, Germany, December 2006. 41
[59] C. Zinner, M. Humenberger, K. Ambrosch, and W. Kubinger, “An optimized software-based
implementation of a census-based stereo matching algorithm,” in Proceedings of the 4th International Symposium on Advances in Visual Computing, ser. ISVC ’08. Berlin, Heidelberg: Springer-Verlag, 2008, pp. 216–227. 41
[60] A. Fusiello and L. Irsara, “Quasi-euclidean uncalibrated epipolar rectification,” in International Conference on Pattern Recognition (ICPR), Tampa, FL, December 2008. 65
[61] P. Bertin, D. Roncin, and J. Vuillemin, “Introduction to programmable active memories,”
pp. 301–309, 1989. 112
[62] T. Kanade, H. Kano, S. Kimura, A. Yoshida, and K. Oda, “Development of a video-rate
stereo machine,” Proceedings of the International Conference on Intelligent Robots and
Systems, vol. 3, pp. 95–100, August 1995. 112, 221, 248

161

[63] T. Kanade, A. Yoshida, K. Oda, H. Kano, and M. Tanaka, “A stereo machine for video-rate
dense depth mapping and its new applications,” Proceedings of the IEEE Computer Society
Conference on Computer Vision and Pattern Recognition, pp. 196–202, June 1996. 112
[64] J. Woodfill and B. Von Herzen, “Real-time stereo vision on the PARTS reconfigurable computer,” Proceedings of the 5th Annual IEEE Symposium on FPGAs for Custom Computing
Machines, pp. 201–210, April 1997. 112, 114
[65] P. Dunn and P. Corke, “Real-time stereopsis using FPGAs,” Field Programmable Logic and
Applications, vol. 1304/1997, pp. 400–409, April 1997. 112, 114, 118, 119, 120, 121, 122
[66] P. Corke and P. Dunn, “Real-time stereopsis using FPGAs,” Proceedings of IEEE Region 10
Annual Conference on Speech and Image Technologies for Computing and Telecommunications, vol. 1, pp. 235–238, December 1997. 112
[67] P. I. Corke, P. A. Dunn, and J. E. Banks, “Frame-rate stereopsis using non-parametric
transforms and programmable logic,” Proceedings of the IEEE International Conference
on Robotics and Automation, vol. 3, pp. 1928–1933, 1999. 112
[68] R. B. Porter and N. W. Bergmann, “A generic implementation framework for FPGA based
stereo matching,” Proceedings of IEEE Conference on Speech and Image Technologies for
Computing and Telecommunications, vol. 2, pp. 461–464, December 1997. 112, 114, 115
[69] K. Konolige, “Small vision system: Hardware and implementation,” in Eighth International
Symposium on Robotics Research, Japan, October 1997. 112, 260
[70] M. Arias-Estrada and J. M. Xicotencatl, “Multiple stereo matching using an extended architecture,” Field-Programmable Logic and Applications, vol. 2147/2001, pp. 203–212, 2001.
113
[71] Y. Miyajima and T. Maruyama, “A real-time stereo vision system with FPGA,” FieldProgrammable Logic and Applications, vol. 2778/2003, pp. 448–457, 2003. 113, 114, 182,
253
[72] M. Perez and F. Cabestaing, “A comparison of hardware resources required by real-time
stereo dense algorithms,” IEEE International Workshop on Computer Architectures for Machine Perception, pp. 299–306, May 2003. 115
[73] A. Darabiha, J. Rose, and W. J. MacLean, “Video-rate stereo depth measurement on programmable hardware,” Proceedings of the IEEE Computer Society Conference on Computer
Vision and Pattern Recognition, vol. 1, pp. I–203–I–210, June 2003. 115, 248
[74] D. K. Masrani and W. J. MacLean, “Expanding disparity range in an FPGA stereo system
while keeping resource utilization low,” Computer Vision and Pattern Recognition, 2005
IEEE Computer Society Conference on, vol. 3, pp. 132–132, June 2005. 115
[75] A. Darabiha, J. MacLean, and J. Rose, “Reconfigurable hardware implementation of a
phase-correlation stereo algorithm,” Machine Vision and Applications, vol. 17, no. 2, pp.
116–132, 2006. 115

162

[76] D. K. Masrani and W. J. MacLean, “A real-time large disparity range stereo-system using
FPGAs,” IEEE International Conference on Computer Vision Systems, January 2006. 115,
247, 248
[77] J. Dı́az, E. Ros, S. Mota, E. M. Ortigosa, and B. del Pino, “High performance stereo computation architecture,” International Conference on Field Programmable Logic and Applications, pp. 463–468, August 2005. 115
[78] J. Dı́az, E. Ros, R. Carrillo, and A. Prieto, “Real-time system for high-image resolution
disparity estimation,” IEEE Transactions on Image Processing, vol. 16, no. 1, pp. 280–285,
January 2007. 115
[79] ——, “Fine grain pipeline systems for real-time motion and stereo-vision computation,”
International Journal of High Performance Systems Architecture, vol. 1, no. 1, pp. 60–68,
2007. 115
[80] Y. Jia, X. Zhang, M. Li, and L. An, “A miniature stereo vision machine (MSVM-III) for
dense disparity mapping,” Proceedings of the 17th International Conference on Pattern
Recognition (ICPR), vol. 1, pp. 728–731, 23-26 Aug. 2004. 116, 248, 261
[81] A. Gil, R. Gutirrez, J. L. Alonso, and S. Fernández de Ávila, “Stereo calculation of significant points using a FPGA,” WEAS Transactions on Information Science and Applications,
2004. 116
[82] M. Hariyama, N. Yokoyama, M. Kameyama, and Y. Kobayashi, “FPGA implementation of
a stereo matching processor based on window-parallel-and-pixel-parallel architecture,” 48th
Midwest Symposium on Circuits and Systems, pp. 1219–1222, August 2005. 116
[83] A. Naoulou, J. Boizard, J. Y. Fourniols, and M. Devy, “An alternative to sequential architectures to improve the processing time of passive stereovision algorithms,” International
Conference on Field Programmable Logic and Applications, pp. 1–4, August 2006. 116
[84] C. Cuadrado, A. Zuloaga, J. Martin, J. Láizaro, and J. Jiménez, “Real-time stereo vision
processing system in a FPGA,” 32nd Annual Conference of the IEEE Industrial Electronics
Society, pp. 3455–3460, November 2006. 116
[85] K. Ambrosch, M. Humenberger, W. Kubinger, and A. Steininger, “Hardware implementation of an SAD based stereo vision algorithm,” IEEE Conference on Computer Vision and
Pattern Recognition, pp. 1–6, June 2007. 116
[86] B. Cyganek, “Comparison of nonparametric transformations and bit vector matching for
stereo correlation,” in 10th International Workshop on Combinatorial Image Analysis (IWCIA), Auckland, New Zealand, December 2004. 133
[87] B. Fröba and A. Ernst, “Face detection with the modified census transform,” Proceedings of
the Sixth IEEE International Conference on Automatic Face and Gesture Recognition, pp.
91–96, May 2004. 143

163

[88] A. Just, Y. Rodriguez, and S. Marcel, “Hand posture classification and recognition using the
modified census transform,” in International Conference on Automatic Face and Gesture
Recognition, April 2006. 143
[89] S. Jin, D. Kim, D. D. Nguyen, and J. W. Jeon, “Pipelined hardware architecture for highspeed optical flow estimation using FPGA,” in Field-Programmable Custom Computing
Machines (FCCM), May 2010. 143
[90] V. H. Milan Sonka and R. Boyle, Image Processing, Analysis, and Machine Vision.
Thomson-Engineering, 2007. 169, 247
[91] Y. Ma, S. Soatto, J. Kosecka, and S. S. Sastry, An Invitation to 3-D Vision: From Images to
Geometric Models. Springer-Verlag, November 2003. 169
[92] E. Trucco and A. Verri, Introductory Techniques for 3-D Computer Vision.
River, NJ, USA: Prentice Hall, March 1998. 169, 176, 247

Upper Saddle

[93] H. Hirschmüller, P. R. Innocent, and J. Garibaldi, “Real-time correlation-based stereo vision
with reduced border errors,” International Journal of Computer Vision, vol. 47, no. 1-3, pp.
229–246, 2002. 187, 241
[94] R. Szeliski and R. Zabih, “An experimental comparison of stereo algorithms,” in Proceedings of the International Workshop on Vision Algorithms: Theory and Practice. SpringerVerlag, 2000, pp. 1–19, LNCS Vol. 1883. 187
[95] C. L. Zitnick and T. Kanade, “A cooperative algorithm for stereo matching and occlusion
detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 7,
pp. 675–684, 2000. 187
[96] P. I. Corke, P. A. Dunn, and J. E. Banks, “Frame-rate stereopsis using non-parametric
transforms and programmable logic,” Proceedings of the IEEE International Conference
on Robotics and Automation, vol. 3, pp. 1928–1933, 1999. 192, 207
[97] R. C. Gonzalez and R. E. Woods, Digital Image Processing, 3rd ed.
NJ, USA: Prentice-Hall, Inc., 2006. 192, 194

Upper Saddle River,

[98] ——, Digital Image Processing, 3rd ed. Upper Saddle River, NJ, USA: Prentice-Hall, Inc.,
2006, pp. 714–715. 196
[99] K.-J. Yoon and I. S. Kweon, “Adaptive support-weight approach for correspondence search,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 4, pp. 650–
656, April 2006. 214, 232
[100] Extending the World’s Most Popular Processor Architecture, Intel Corporation, 2006, white
Paper. 236
[101] Intel SSE4 Programming Reference, Intel Corporation, April 2007. 236
[102] C. Vancea and S. Nedevschi, “LUT-based image rectification module implemented in
FPGA,” IEEE International Conference on Intelligent Computer Communication and Processing, pp. 147–154, September 2007. 247
164

[103] N. Lawal and M. O’Nils, “Embedded FPGA memory requirements for real-time video processing applications,” 25th NORCHIP Conference, pp. 206–209, November 2005. 249
[104] D. L. Cardon, W. S. Fife, J. K. Archibald, and D. J. Lee, “Fast 3D reconstruction for small
autonomous robots,” in Proceedings of 31st Annual Conference of the IEEE Industrial Electronics Society (IECON), November 2005. 259, 272, 333
[105] J. D. Anderson, D. J. Lee, and J. K. Archibald, “FPGA implementation of vision algorithms
for small autonomous robots,” in Proceedings of the SPIE vol. 6006, Intelligent Robots and
Computer Vision XVII: Algorithms, Techniques, and Active Vision, October 2005. 259, 272
[106] Y. Nagaonkar, B. Call, S. Cluff, J. Archibald, and D. Lee, “Autonomous mobile robotic
system with onboard vision using configurable logic,” Industrial Electronics Society, 2005.
IECON 2005. 31st Annual Conference of IEEE, 6–10 November 2005. 259, 272
[107] S. Ettinger, M. Nechyba, P. Ifju, and M. Waszak, “Vision-guided flight stability and control for micro air vehicles,” IEEE/RSJ International Conference on Intelligent Robots and
Systems, vol. 3, pp. 2134–2140, 2002. 260
[108] F. Ruffier and N. Franceschini, “Visually guided micro-aerial vehicle: automatic take off,
terrain following, landing and wind reaction,” Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), vol. 3, pp. 2339–2346, April 26–May 1, 2004.
260
[109] J. Redding, T. McLain, R. Beard, and C. Taylor, “Vision-based target localization from a
fixed-wing miniature air vehicle,” American Control Conference, 14–16 June, 2006. 260
[110] S. Takezawa and G. Dissanayake, “Simultaneous localisation and mapping problems in indoor environments with stereovision,” 31st Annual Conference of the IEEE Industrial Electronics Society (IECON), 6-10 Nov. 2005. 260
[111] S. Mahlknecht, R. Oberhammer, and G. Novak, “A real-time image recognition system for
tiny autonomous mobile robots,” Proceedings of the 10th IEEE Real-Time and Embedded
Technology and Applications Symposium (RTAS), pp. 324–330, 25-28 May 2004. 261
[112] R. Wood, S. Avadhanula, E. Steltz, M. Seeman, J. Entwistle, A. Bachrach, G. Barrows,
S. Sanders, and R. Fearing, “Design, fabrication and initial results of a 2g autonomous
glider,” 31st Annual Conference of the IEEE Industrial Electronics Society (IECON), 6-10
Nov. 2005. 261
[113] H. Yamada, T. Tominaga, and M. Ichikawa, “An autonomous flying object navigated by
real-time optical flow and visual target detection,” Proceedings of the IEEE International
Conference on Field-Programmable Technology (FPT), pp. 222–227, 15-17 Dec. 2003. 261
[114] Y. Jia, M. Li, L. An, and X. Zhang, “Autonomous navigation of a miniature mobile robot
using real-time trinocular stereo machine,” IEEE International Conference on Robotics, Intelligent Systems and Signal Processing, vol. 1, pp. 417–421, 8-13 Oct. 2003. 261
[115] Y. Kim, S. Park, C. Chen, and H. Jeong, “Real-time architecture of stereo vision for robot
eye,” The 8th International Conference on Signal Processing, vol. 1, 16-20 2006. 261
165

[116] A. Kolar, T. Graba, A. Pinna, O. Romain, B. Granado, and T. Ea, “A digital processing architecture for 3D reconstruction,” Computer Architecture for Machine Perception and Sensing.
International Workshop on, pp. 172–176, 18–20 August 2006. 261
[117] C. Worth, M. Bajura, J. Flidr, and B. Schott, “On-demand linux for power-aware embedded
sensors,” in Proceedings of the Ottawa Linux Symposium, Ottawa, Ontario Canada, July
2004. 261
[118] PC/104 Consortium. http://www.pc104.org. 264
[119] R. Kulkarni, “Soft-processing IP avoids obsolescence challenges,” Xilinx Embedded Magazine, pp. 40–42, November 2006. 276
[120] A. Telikepalli, “Performance vs. power: Getting the best of both worlds,” Xcell Journal,
Third Quarter 2005. 279
[121] Stratix III Programmable Power, Altera Corporation, 2007, white paper. 280
[122] Virtex-4 Family Overview, Xilinx, Inc., September 2007. 281
[123] Virtex-4 Packaging and Pinout Specification, Xilinx, Inc., June 2007. 281
[124] System ACE CompactFlash Solution, Xilinx, Inc., 2002, datasheet DS080. 284
[125] Platform Flash In-System Programmable Configuration PROM, Xilinx, Incorporated, 2008,
datasheet DS123. 284
[126] “IEEE standard test access port and boundary-scan architecture,” IEEE Std 1149.1-1990, 21
May 1990. 285
[127] MT48H16M16LF, MT48H8M32LF — 256Mb: x16, x32 Mobile SDRAM, Micron Technology, Inc., 2008, datasheet Rev. G 2/08 EN. 288, 291
[128] CY7C1470V25, CY7C1472V25, CY7C1474V25 — 72-Mbit Pipelined SRAM with NoBL Architecture, Cypress Semiconductor Corporation, 2005, datasheet 38-05290 Rev. *G. 289,
294
[129] MT46V64M4, MT46V32M8, MT46V16M16 — 256Mb: x4, x8, x16 DDR SDRAM, Micron
Technology, Inc., 2007, datasheet Rev. A 4/07 EN. 291
[130] Stub Series Terminated Logic for 2.5V (SSTL 2), JEDEC, May 2002, standard JESD8-9B.
291
[131] Common Flash Interface (CFI), JEDEC, September 2003, standard JESD68.01. 295
[132] CY7C68014A EZ-USB FX2LP USB Microcontroller, Cypress Semiconductor Corporation,
2005, datasheet #38-08032 Rev. I. 298
[133] Virtex-4 FPGA PCB Designers Guide, Xilinx, Inc., June 2008. 303, 310
[134] LTC1778 Wide Operating Range, No RSENSE Step-Down Controller, Linear Technology
Corporation, 2001, datasheet. 306, 308
166

[135] H. Johnson, High-Speed Digital Design: A Handbook of Black Magic. Prentice Hall, 1993.
310, 313
[136] Design Guide for the Packaging of High Speed Electronic Circuits, IPC, 2003, IPC-2251.
314
[137] W. S. Fife and J. K. Archibald, “Reconfigurable on-board vision processing for small autonomous vehicles,” EURASIP Journal on Embedded Systems, vol. 2007, September 2006,
article ID 80141. 333
[138] B. B. Edwards, W. S. Fife, J. K. Archibald, D. J. Lee, and D. K. Wilde, “A design approach
for small vision-based autonomous vehicles,” in Proceedings of the SPIE vol. 6384, Intelligent Robots and Computer Vision XXIV: Algorithms, Techniques, and Active Vision, Boston,
MA, USA, October 2006. 334
[139] J. D. Anderson, D. J. Lee, Z. Y. Wei, and J. K. Archibald, “Semi-autonomous unmanned
ground vehicle control system,” in Proceedings of the SPIE vol. 6230, International Symposium on Defense and Security, Unmanned Systems Technology VIII, Orlando, FL, USA,
April 2006. 334
[140] B. B. Edwards, J. K. Archibald, W. S. Fife, and D. J. Lee, “A vision system for precision MAV targeted landing,” International Symposium on Computational Intelligence in
Robotics and Automation (CIRA), pp. 125–130, 20–23 June 2007. 334
[141] A. Dennis, J. Archibald, B. Edwards, and D. J. Lee, “On-board vision-based sense-andavoid for small UAVs,” in Proceedings of the AIAA Conference on Guidance, Navigation
and Control, Honolulu, HI, USA, August 18–21 2008. 335
[142] S. G. Fowers, B. J. Tippetts, D. J. Lee, and J. K. Archibald, “Vision-guided autonomous
quad-rotor helicopter flight stabilization and control,” in AUVSI Unmanned Systems North
America, June 2008, San Diego, CA, USA, June 10–12 2008. 335
[143] K. Lillywhite, D. J. Lee, B. Tippetts, S. Fowers, A. Dennis, B. Nelson, and J. Archibald,
“An embedded vision system for an unmanned four-rotor helicopter,” in Proceedings of the
SPIE, vol. 6384, Intelligent Robots and Computer Vision XXIV: Algorithms, Techniques, and
Active Vision, October 2 2006. 335
[144] Z. Wei, D. J. Lee, B. Nelson, and M. Martineau, “A fast and accurate tensor-based optical
flow algorithm implemented in FPGA,” IEEE Workshop on Applications of Computer Vision
(WACV), February 2007. 335
[145] Z. Wei, D. J. Lee, and B. Nelson, “FPGA-based real-time optical flow algorithm design and
implementation,” Journal of Multimedia, vol. 2, no. 5, pp. 38–45, September 2007. 335

167

168

APPENDIX A.

INTRODUCTION TO STEREO VISION

A variety of different machine vision algorithms could be implemented on an FPGA-based
computing platform. One of the most important machine vision techniques is stereo vision. Stereo
vision, also called stereoscopic vision or stereopsis, is a method in which two cameras, spaced
slightly apart, are used to view a scene. Based on the differences between the two images, taken
simultaneously from both cameras, it is usually possible to calculate distance to points in either
image, if enough information about the cameras and their relative placement is known. This allows,
to some extent, a stereo vision system to determine the three-dimensional (3D) geometry of a scene.
Stereo vision is actually a constrained case in the more general field of multiple view geometry. Many vision systems have been created that use two, three, or many more camera views
in order to reconstruct the 3D geometry of a scene. Additionally, it is not required that the images
be taken by different cameras. Instead, it is possible to use a single camera that moves through the
scene and takes pictures at different times from multiple positions. Conversely, it is also possible
for the objects in the scene to move and the camera to remain stationary. These latter cases, in
which the camera and objects in the scene move relative to one another, are commonly referred to
as structure from motion. In any case, the generation of a 3D model of the scene, called reconstruction, consists of identifying corresponding points between the images and then using knowledge
of camera characteristics, camera position, camera motion, and scene motion to estimate 3D geometry.
This section will summarize stereo vision (i.e., the correspondence problem between two
camera views) in order to facilitate understanding of the algorithms discussed in this dissertation.
Only a minimum of discussion will be provided for the unfamiliar reader. For a more in-depth
discussion of camera models and stereo vision, the reader is encouraged to consult one of many
textbooks on machine vision (e.g., [90]–[92]).

169

A.1

Camera Models and Projection
In order to understand stereo vision, we must first have a basic understanding of image

sensors and how they perceive the world. Perhaps the simplest camera model is called the pinhole
camera model. In a pinhole camera, light from a scene passes through a tiny hole onto the image
plane, which in an actual camera would be a photographic film or an image sensor. A pinhole
camera is illustrated in Figure A.1(a). For simplicity, we often avoid the inverted image and instead
think of the image plane as being in front of the focal point, as shown in Figure A.1(b).

Focal Length
Focal Point
Optical Axis

Image Plane

(a) Pinhole Camera

Focal Length

Virtual
Image Plane

(b) Virtual Image Plane

Figure A.1: Pinhole camera.

This results in the camera projection model that is illustrated in Figure A.2, which shows
only the side view of the camera model. In this example, the point P, at world coordinates (X,Y, Z)
with the origin at the focal point of the camera, is projected onto the image plane at the point p,
having image coordinates (x, y) with the image origin in the center of the image plane. Using

170

this camera model, we then have the following relationship between the world coordinates and the
image coordinates:
X
,
Z

(A.1)

Y
y= f .
Z

(A.2)

x= f

Image Plane
Focal Point

Optical Axis
y
p

Y

f
Z

P

Figure A.2: Pinhole camera projection model.

A.2

A Simple Stereo Configuration
In the canonical stereo vision configuration, two cameras are placed a known distance apart,

called the baseline, and are pointed such that the optical axis of each camera runs parallel to the
other.
Now suppose that a point P, with world coordinates (X,Y, Z), is within the field of view of
both the left and right cameras and projects onto the points pl and pr , respectively at coordinates
(xl , yl ) and (xr , yr ) in the left and right image planes. This scenario is shown in Figure A.3. The
points Cl and Cr are each called the center of projection for the respective camera, and are the
same as the focal points for an ideal pinhole camera.
Based on the similar triangles formed by the points Cl PCr and pl Ppr , we have
b − xl + xr
b
= .
Z− f
Z
171

(A.3)

P

Z

pr

pl
Cl

Cr

f

b
Figure A.3: Canonical stereo configuration.

Solving for the distance Z, we obtain
Z= f

b
.
xl − xr

(A.4)

The distance xl − xr in Equation A.4 is called the disparity for the points pl and pr and
represents the difference in horizontal position between the views of a point P as seen by the two
cameras. The fundamental goal of a typical stereo vision system is to determine the disparity for
every pixel in either the left or right camera view. This information can immediately be used to
determine relative distance between points in a scene or to estimate real world distance.
One common way to represent this disparity information is with images generated directly
from the disparity values. Such images are commonly called disparity maps or disparity images.
In such an image, each pixel’s intensity represents the disparity for that point in the view of a scene.
The left and right images as well as the disparity map for a common stereo vision dataset, known
as the Tsukuba images, are shown in Figure A.4. Due to the simple, inverse relationship between
disparity and distance (Equation A.4), these images are also commonly called depth maps.

A.3

The Correspondence Problem
In order to determine the disparity for a point pl in the left image, we must find the point

pr in the right image that corresponds to the same point in the scene, or vice versa. Finding
corresponding points in the two images is often called the correspondence problem.

172

(a) Left Camera Image

(b) Right Camera Image

(c) Disparity Image

Figure A.4: The Tsukuba image dataset.

Many similarity measures have been developed that quantify the similarity of two pixel
locations. Most of these measures involve a computation performed over the windows surrounding the two pixels being compared. Such window-based search algorithms are often called local
search techniques, since the disparity computations depend only on pixels within a small, finite
window. The terms area-based and template matching are also commonly used to describe such
methods. Additionally, many stereo implementations use only grayscale images (i.e., pixel intensities) and ignore color information, thus simplifying the comparison computation, although color
information can be used as additional matching criteria.
The simplest, area-based method for solving the correspondence problem is illustrated in
Figure A.5 and works as follows. First, we select either the left or right camera view as the refer-

173

ence image, which is somewhat analogous to the dominant eye in the human vision system. The
other image will be referred to as the search image. We then consider each pixel in the reference
image, one by one. For each pixel in the reference image, we read the N × N window of pixels centered at the considered pixel. This small region of pixels is called the template window. We then
read an N × N window from the search image, called the candidate window. A similarity measure
is then used to compare the candidate window to the template window. Next, the candidate window is shifted by one pixel horizontally and the comparison with the template window is repeated.
This description assumes that the images have been rectified, as will be described in Section A.5.
This procedure is repeated for a specific number of search iterations, d, ensuring that a reasonable
portion of the search image is searched. After all search iterations have been completed, the pixel
in the search image whose candidate window has the best similarity value is chosen as the match.
For an M ×M image, the O(M 2 N 2 d) computational complexity of this method is quite high,
although the complexity can sometimes be reduced, depending on the similarity metric, through
certain optimizations or at the expense of intermediate storage. Window sizes are typically between
5×5 and 21×21 depending on the similarity metric and image resolution. In practice, the value for
the disparity search range, d, is usually a small fraction of the image width (e.g., 1/10th ) and does
not need to be very large. Assuming a calibrated, canonical stereo configuration, where objects
seen by the right camera view are guaranteed to appear further to the left than they appear in the
left camera view, we can start our candidate search at the same location as the template window
and search for d iterations, where d is typically from 16 to 128 depending on the resolution of the
images, the baseline of the stereo configuration, and the need to be able to see very close objects.
One of the most commonly used similarity measures is called SAD (Sum of Absolute Differences). The SAD measure is mathematically equivalent to the L1 distance, also called rectilinear,
city-block, or Manhattan distance. To compute the SAD similarity measure, the N × N windows
around the two pixels to be compared are read and the L1 distance between the two vectors is
computed. This is shown mathematically by the expression
R

SAD :

R

∑ ∑

|I1 (x1 + i, y1 + j) − I2 (x2 + i, y2 + j)|.

i=−R j=−R

174

(A.5)

Candidate
Window

Template
Window

Search Image (Left)

Reference Image (Right)

Figure A.5: Window-based stereo matching. This figure assumes the input images have
been rectified (Section A.5).
This expression compares pixel (x1 , y1 ) in image I1 to pixel (x2 , y2 ) in image I2 using an N × N
window, where N equals 2R + 1. In the case of SAD, and many other similarity measures used for
stereo vision, the result is actually a measure of dissimilarity or error. However, this text will use
the term similarity generally to refer to both similarity and dissimilarity measures.
The mathematical simplicity and the lack of more complicated arithmetic operations, such
as multiplication, division, and so forth, make SAD the simplest similarity measure that gives
reasonably good results. For this reason, it is the most common similarity measure used in realtime applications. Several other similarity measures for comparing pixels will be discussed in
depth in Section B.3.
A search method such as that described above results in a very dense disparity map, where
a disparity estimate is given for every pixel. Other methods, called feature-based methods, instead
select a sparse set of points in the reference image and find the corresponding points in the search
image. Typically, points are chosen in the reference image that have distinct features that make
them easier to identify in the other image. Also, since the set of points to match is much smaller,
feature-based methods generally run faster. The disadvantage of the feature-based methods is
that the disparity information is unknown for much of the image, possibly resulting in objects or
obstacles that are missed by the stereo vision system.
It is the search for each pixel in a dense stereo vision system that makes stereo vision such
a computationally intensive problem. Fortunately, this search also allows a tremendous amount of
parallelism to be exploited. For example, we can easily perform dozens of comparisons in parallel,
175

given the right hardware architecture. Such architectures are discussed in detail in Chapter 6 and
Appendix C.

A.4

Camera Calibration
Unfortunately, the pinhole camera model and canonical stereo vision configuration are

overly simplistic. Real cameras deviate from the ideal camera model due to the shape of the
lenses used and imperfect lens alignment. Additionally, a perfect canonical stereo vision alignment is very difficult to achieve in practice, and in some stereo vision systems the cameras may be
intentionally configured so that the optical axes of the cameras are not parallel.
In order to calculate the 3D coordinates of points in the scene, we must first accurately
determine the parameters of the stereo configuration, a process called camera calibration. Several
methods exist for determining these parameters for a stereo configuration. Given the parameters,
usually encoded in the form of various matrices, it is possible to estimate the real-world position
of a corresponding pair of image points for a general stereo vision system [92].

A.5

Epipolar Geometry and Rectification
In general, given a point in one image, the problem of finding the corresponding point in

the other image involves a 2D search (i.e., horizontal and vertical in the image). However, it is
possible to reduce this 2D search to a 1D search through a process known as rectification.
The geometry of a general stereo camera configuration, known as epipolar geometry, is
illustrated in Figure A.6. As can be seen from the figure, the rays from the cameras’ centers of
projection to a point P in the scene form a plane, called the epipolar plane. The intersection of the
epipolar plane with the image planes forms two lines, one in each image plane, called the epipolar
lines. Given the point pl in the left image, the corresponding point pr in the right image must
necessarily be located on the epipolar line in the right image plane.
Rectification is a transformation that warps the images so that the epipolar lines of each
image run parallel to the horizontal image axis. Once this transformation is applied, finding the
point (xr , yr ) in the right image that corresponds to the point (xl , yl ) in the left image is simply

176

Epipolar
Line
pl

P

Epipolar
Plane

Epipolar
Line

pr

Cr

Cl

Figure A.6: Epipolar geometry.

a matter of performing a 1D search along row yl of the right image. Rectification dramatically
simplifies the correspondence search and its use is assumed in most stereo vision implementations.

A.6

Problems with Stereo Matching
On the surface, the stereo correspondence problem seems relatively straightforward, but in

practice false matches are a common and serious problem. False matches occur for many reasons.
One of the most common reasons is that the corresponding point is not visible in the other image.
In any stereo vision configuration, there will be points that are within the field of view of one
camera but not the other, as demonstrated in Figure A.7. Many of these false matches can be
eliminated by not considering points near the edge of the image, but this method can eliminate
correct matches and cannot eliminate all the false matches due to this problem.
Another common reason is that a point visible in one camera may be occluded in the other
camera. This is generally the case along object boundaries where the disparity of the foreground
pixels is larger than the disparity of the background pixels, as illustrated in Figure A.8(a). In this
figure, a point is clearly visible in the right camera’s view but is occluded by a foreground object in
the left camera’s view. Occlusions such as this occur at every object boundary. A similar situation
can occur when a surface runs parallel to the line of sight of one of the cameras, as shown in
Figure A.8(b).
Matching near object boundaries is particularly troublesome, since the background seen
next to a foreground object’s boundary changes with each view due to parallax. This is illustrated
in Figure A.9. In this figure, the region of pixels around the same point on the house, as marked
by the square over the lower right edge of the house, appears differently in each view because the
177

(a) Left Camera View

(b) Right Camera View

Figure A.7: Field of view occlusion. The darkened regions represent those not actually visible
by the camera. Assuming a canonical stereo configuration, the left camera sees regions on the
left edge that cannot be seen in the right camera. Similarly, the right camera sees regions on
the right edge that cannot be seen in the left camera.

pr
Cl

Cr

Cl

Cr

b

b

(a)

(b)

Figure A.8: Point occlusion. (a) The single point on the furthest block is clearly
visible by the right camera but is occluded in the left camera view. (b) What
looks like a single point in the left camera view may be several points in the
right camera view.

178

tree in the background moves differently than the house in the foreground between the two camera
views.

Left Camera View

Right Camera View

Figure A.9: Object border changes, caused by parallax effects of objects at
different distances.

Object distortion, also called figural distortion, also makes correspondence more difficult.
Most objects will look slightly different when seen from different view points, as illustrated in
Figure A.10. The difference becomes more significant as the object comes closer to the stereo pair
of cameras.

Left Camera View

Right Camera View

Figure A.10: Object distortion, caused by being seen from different view
points. The same object looks slightly different in each camera view.

Additionally, many objects lack sufficient texture to identify corresponding points. This
may include dark regions of the scene that did not receive sufficient lighting or objects within the
scene that have uniform color and lighting. Points on these surfaces are impossible to accurately
identify in the corresponding image using area-based correlation. In practice, the best we can do is
179

determine the disparity for the edges of these objects and assume that the region within the edges
has the same disparity.
Correspondence is also made more difficult by the lighting of the scene. The positions of
light sources and objects in the scene are fixed relative to each other. As a result, the difference in
camera positions will cause the specular reflections of the lighting to appear at different locations
on the objects. This causes the disparity estimates for points around such specular reflections to be
inaccurate.
In addition to scene complexities, imperfections in the cameras themselves can also result
in false matches. Noise, which is essentially random in each camera, will cause corresponding
regions of the images to look slightly different. Vignetting, or the darkening of an image near the
edge of the lens, can make the corresponding regions of the images appear darker in one image.
Differences in pixel gain and bias can also affect the two images differently. The stereo image
processing must take these effects into account in order to allow for accurate correlation.
Clearly, stereo correspondence is an ill-posed problem with inherent ambiguities. The
false matches caused by the scenarios described above are the greatest challenges when creating
a robust stereo vision system. These false matches often result in 3D point estimates that are
horribly incorrect and, if too frequent, make the use of such a stereo vision system impractical.
As a result, it is better to discard a potentially false match rather than allow it to corrupt the scene
reconstruction. In other words, no disparity estimate is generally better than an incorrect disparity
estimate.
The ability to reduce and filter false matches is critical for any useful stereo vision system,
particularly those intended for use on autonomous vehicles. Many algorithms have been proposed
to reduce the number of false correspondences. Much of this chapter will be dedicated to the
analysis of such methods.

A.7

Window Summing Optimizations
The naive implementation of most area-based correlation methods has a computational

complexity of O(M 2 N 2 d), assuming the image size is M × M and the correlation window size
is N × N. However, since most similarity measures involve summing a simple function over the

180

correlation window, it is possible to reuse much of the computation for each similarity measure
evaluation.
For example, suppose we compute the similarity measure for the pixels at columns xl in
the left image and xr in the right image, both on the same pixel row. When we later wish to
compute the similarity measure for pixels xl + 1 and xr + 1, N − 1 columns of the correlation
window will be exactly the same as for the previous similarity measure computation and only one
column of the correlation window needs new computation. Furthermore, the new column reuses
N − 1 pixels from the same column of the previous row. If we reuse data in the computation of
both the correlation window sum and the new column sum then we can reduce the computational
complexity of correlation from O(M 2 N 2 d) to O(M 2 d). In other words, a larger correlation window
does not affect the amount of computation required. Figure A.11 illustrates how the window and
column summing optimizations work.

+

Column Sum from
Previous Row

-

New Pixel
Difference

=

Trailing Pixel
Difference

New Column
Sum

(a) Computing Column Sums

+

Previous Window Sum

-

New Column Sum

=

Trailing Column Sum

New Window Sum

(b) Computing Window Sums

Figure A.11: Window summing optimization using column sums. (a) Computing column
sums requires an M-word buffer to obtain the previous column sum and an MN-word
buffer to obtain the trailing pixel difference. (b) Computing the window sum requires
a 1-word buffer to obtain the previous window sum and an N-word buffer to obtain the
trailing column sum.

181

Use of such summing optimizations leads to trade-offs between computational complexity and memory requirements. Consider the summing optimization of Figure A.11. This method
requires an M-word memory buffer to store the previous M column sums, an MN-word memory
buffer to store the previous MN pixel difference values, and will typically use an N-word memory
buffer to store the previous N column sums. The N-word column sum buffer avoids the need to
read the same column sum buffer twice for each new window sum computation. The width of each
of these memories must be sufficient to hold the pixel differences and column sums, respectively,
where the widths depend on the number of bits used to represent the pixels, the image preprocessing method, the similarity measure being employed, and the size of the correlation window.
It is not required to use both the window summing optimization and the column summing
optimization. For example, the stereo implementation of Miyajima and Maruyama [71] does not
use the column summing optimization. Instead, it computes each new column sum by reading the
new column of pixels for the next correlation window, one pixel at a time, and using an iterative
serial adder to compute the new column sum. This increases the computational complexity to
O(M 2 Nd) but only requires N column sums to be buffered in memory. However, the serial nature
of their implementation greatly decreases the throughput they achieve.
Another variation of the window summing optimization is shown in Figure A.12. In this
version, we use row sums to compute new window sums instead of using columns. This variation
requires an N-word memory to store the previous pixel difference values, an M-word memory to
store the previous window sums, and an MN-word memory to store the previous row sums.
Note that in a highly parallelized hardware implementation, there is typically a separate
similarity module for computing the similarity measure for each disparity value. Each similarity module will require its own copy of each of these buffers, leading to a significant amount of
storage. It may be possible to share these memories, but access to the memories must be carefully controlled so as to avoid conflicts and to avoid exceeding the throughput capabilities of the
memories. In practice, many stereo implementations employ a compromise between using the full
window summing optimization and performing redundant computation so as to reduce the number
of independent memories and the amount of memory required for the implementation.

182

-

+
Row Sum from
Previous Column

New Pixel Difference

=
Trailing Pixel Difference

New Column Sum

(a) Computing Row Sums

+

Window Sum from
Previous Row

-

New Row Sum

=

Trailing Row Sum

New Window Sum

(b) Computing Window Sums

Figure A.12: Window summing optimization using row sums. (a) Computing row sums
requires a 1-word buffer to obtain the previous row sum and an N-word buffer to obtain the
trailing pixel difference. (b) Computing the window sum requires an M-word buffer to obtain
the previous window sum and an MN-word buffer to obtain the trailing row sum.

183

184

APPENDIX B.

STEREO VISION METHODS

In this chapter, I compare the correlation accuracy of a variety of stereo correlation methods
and optimizations over a range of parameters using well-accepted image data sets and consistent
error metrics. Section 2.4 will discuss some of the existing work that has been done with stereo vision algorithms that are well-suited to hardware implementation. Section B.1 will discuss metrics
for evaluating the quality of an image disparity map. These metrics will be used to compare the
correlation accuracy of the stereo vision algorithms. The final three sections, B.2–B.4, represent
the core of this chapter and will discuss in detail the three essential components of a stereo correspondence implementation, including preprocessing, correlation, and post-processing. Various
techniques for each will be quantitatively compared.
To facilitate the gathering of data for use in the comparison of stereo methods, a stereo vision software framework has been written in C++ with the use of OpenCV for primitive image data
types and image file management. This framework allows the accuracy of a variety of area-based
stereo vision algorithms to be evaluated by running them on several common stereo image datasets.
Additionally, the various algorithm parameters can be easily adjusted to see how performance is
affected over a range of parameters before making the investment to implement an algorithm in
custom hardware. Correlation accuracy is computed by comparing the resulting image disparity
maps to accepted ground-truth data.

B.1
B.1.1

Quantitative Stereo Evaluation
Error Measures
A variety of metrics have been used to evaluate the quality of a disparity map in the presence

of reliable ground-truth data. Many of the earlier works on stereo vision, which generally used
random dot stereograms as test images, used the mean absolute error (MAE) to evaluate stereo

185

vision accuracy. In the context of stereo vision, this metric represents the average amount of error
in the disparity map and is defined mathematically as

MAE :

1 M−1 N−1
∑ ∑ |D(i, j) − T (i, j)|,
MN i=0
j=0

(B.1)

where D(i, j) is the disparity estimate for pixel (i, j) and T (i, j) is the ground-truth disparity map,
both M × N images.
Another metric is the mean squared error (MSE). The MSE uses the square of the error
instead of the absolute value, thus exaggerating larger errors. The MSE is defined mathematically
by the expression
MSE :

1 M−1 N−1
[D(i, j) − T (i, j)]2 .
∑
∑
MN i=0 j=0

(B.2)

A better alternative to MSE that has been more commonly used in evaluating correlation
accuracy is the root mean square (RMS) error. This is defined as the square root of the MSE, or

RMS :

1 M−1 N−1
[D(i, j) − T (i, j)]2
∑
∑
MN i=0 j=0

!1

2

.

(B.3)

By taking the square root of the MSE, we restore the error measure to the original units, while still
somewhat exaggerating large errors.
Of the three measures, the MAE has seen the most usage. This seems to be due to the
disadvantages of exaggerating large disparity errors through squaring. When a disparity estimate
error is larger than one, it tends to be very large, making the value of the squared error somewhat
deceptive. These large errors—when the absolute error is greater than one—are commonly called
gross errors. Additionally, most large errors occur at disparity discontinuities, where the stereo
vision system mistakenly assigns either the foreground or the background disparity to the pixel.
For practical purposes, both estimates are essentially correct and magnifying these large disparity
errors is counterproductive.

186

B.1.2

Pixel Categorizations
One of the problems with the error measures of Section B.1.1 is that they attempt to quantify

the amounts by which disparity estimates deviate from the ground truth. When using area-based
correlation, the amount of gross disparity error for a pixel is often not as much a function of the
correlation method as it is a function of the stereo image characteristics and random noise. In
other words, the amount of gross disparity error is not nearly as meaningful as just knowing that a
correlation mismatch occurred. For this reason, many newer error metrics attempt to classify the
each disparity estimate as simply correct, incorrect, or some other classification (e.g., [1]).
Pixel categorizations make it easy to quantitatively compare stereo methods that employ
error filters, which attempt to invalidate grossly erroneous estimates. Popular pixel categorizations
for such stereo methods include the percentage of pixels that were assigned a correct disparity
estimate (which will be referred to as the percentage of correct pixels, or PCP), the percentage of
pixels that were erroneously assigned a disparity estimate (percentage of erroneous pixels, or PEP),
and the percentage of pixels for which the disparity estimates were invalidated by an error filter
(percentage of rejected pixels, or PRP) [93]. Defined in this way, these metrics have the following
property:
PCP + PEP + PRP = 100%.

(B.4)

The definition of “correct” has also varied in the literature. For some authors, a correct
disparity estimate is defined as an exact match between the disparity estimate and the true disparity.
However, many authors define correct to be an error of one pixel worth of disparity or less [1],
[94], [95]. This definition attempts to separate gross errors from errors due mostly to rounding and
image quantization.

187

With this alternate definition of correctness in mind, we define the PCP, PEP, and PRP
metrics by the expressions
PCP :

100% ·

1
J (|D(i, j) − T (i, j)| ≤ δ ) ,
MN [i,∑
j]∈A

(B.5)

PEP :

100% ·

1
J (|D(i, j) − T (i, j)| > δ ) ,
MN [i,∑
j]∈A

(B.6)

PRP :



|A|
,
100% · 1 −
MN

(B.7)

where A is the set of all coordinates for which a disparity estimate was assigned, J(B) is an indicator
function of event B (i.e., J(B) = 1 if B is true and J(B) = 0 otherwise), and δ is 1 if we only wish
to include gross errors or 0 if we wish to include all errors. Notice that, due to the relationship of
Equation B.4, only two of these metrics need to be calculated in order to determine the third.
Some works have also sought to subdivide the errors in order to identify the cause of the
incorrect disparity assignments. This is particularly useful when examining the characteristics of
a particular stereo method or attempting to improve its performance for certain image datasets.
Common error categorizations include [1]:
• Errors due to occlusion
• Errors due to object borders (i.e., depth discontinuities)
• Errors due to lack of texture
This work will not attempt to categorize errors in this way, but instead will rely primarily on the
total errors, without specific emphasis being placed on the precise nature of the errors.

B.1.3

Proposed Metrics
The principle disadvantage of the pixel categorization metrics described in Section B.1.2

is that the percentages for correct and erroneous pixels are taken from the set of all pixels. As a
result, a stereo method can produce a higher PCP than another method, but also have a higher PEP
and lower PRP. The higher PCP measure would suggest that the method is superior, but the higher
188

PEP suggests otherwise, and the clear answer to which is superior becomes ambiguous. Similarly,
a stereo method can have a relatively low PCP, but also a low PEP and a high PRP. The low PCP
value would suggest that this method is inferior when in reality the low PCP value is due to a
more aggressive error filter. A common pitfall when using these metrics is to blindly compare only
the PCP measure based on the assumption that this is the best measure of accuracy, without also
considering the relative magnitudes of the PEP and PRP metrics. This is dangerous because, for
real-world applications, it is often more important to minimize the number of gross errors than it
is to maximize the number of correct estimates.
To remedy this situation and simplify comparison between methods, a new metric is proposed for the percentage of correct disparity estimates, which will be taken only from the set of
pixels that were assigned a disparity estimate. This metric will be referred to as the percentage
correct of assigned disparities (PCA). The percentage of assigned disparities which are erroneous
then becomes 100% − PCA. Additionally, we define the disparity estimate density (DED), which
is the percentage of the total pixels that were assigned a disparity (i.e., not invalidated by an error
filter). These two metrics allow us to more easily compare two different stereo methods, either of
which may employ error filters. The first metric (PCA) is of primary importance and the second
metric (DED) is of secondary importance, depending somewhat on the application. These metrics
are defined by the following equations:
PCA :

100% ·

1
J (|D(i, j) − T (i, j)| ≤ δ ) ,
|A| [i,∑
j]∈A

(B.8)

DED :

100% ·

|A|
.
MN

(B.9)

Since these new metrics are simply a consolidation of the metrics of Section B.1.2, we can easily
interchange them using the following relationships:
PCP = PCA · DED,

(B.10)

PEP = (1 − PCA) · DED,

(B.11)

PRP = 1 − DED.

(B.12)

189

It is also possible to modify the MAE and RMS error metrics (Equations B.1 and B.3) to
consider only those pixels that were assigned a disparity. This allows us to quantitatively compare
the amount of error in a stereo method employing error filters, if desired.

B.1.4

Stereo Method Evaluation
Combined, the described metrics allow us to evaluate the accuracy of a disparity map in a

variety of ways. For most stereo vision method comparisons, this work will rely upon the PCA
metric (percentage correct of assigned disparities), where correct is defined as being within one of
the ground-truth disparity (i.e., δ = 1 in Equation B.8). However, other metrics, particularly DED
(Equation B.9), will be used where relevant in order to highlight various characteristics of a given
method.
At the time of this writing, a variety of image datasets are available for the evaluation and
comparison of different stereo methods. This work will focus on the four stereo image datasets
proposed by Scharstein and Szeliski for general stereo method evaluation. A sample camera image from each dataset was shown in Figure 2.5, on page 30, and the ground-truth disparity maps
were shown in Figure 2.6. These images exhibit a fairly diverse set of image characteristics and
represent, perhaps, the best benchmark suite currently available for the evaluation of stereo vision
methods.

B.2

Preprocessing Methods
Most area-based correlation methods do not perform well on the raw, unprocessed stereo

image pair. Many stereo implementations will apply some form of preprocessing in order to prepare the images for the correlation step. These preprocessing methods generally attempt to normalize the stereo image pair in some way to eliminate various undesirable image characteristics
that would otherwise cause incorrect matches.
SAD correlation, perhaps the most common and the only similarity measure described thus
far, will provide a baseline correspondence method for comparison. Section B.3 will introduce and
compare various other correlation methods. Care will also be taken to ensure that the use of SAD

190

does not bias the results in any way, although the results for other similarity measures will not be
shown in this section.
The effectiveness of SAD correlation, without any preprocessing or error filtering, is shown
in Figure B.1 for various correlation window sizes. This figure shows the percentage of pixels that
were correctly matched to within 1 of the correct pixel (i.e., Equation B.8 with δ = 1). The best
that SAD correlation, by itself, is able to achieve on average for the collective image datasets is
correct estimation of 85.7% of the pixel disparities when a 13 × 13 correlation window is used.
This correlation accuracy is the standard we wish to improve upon by preprocessing the input
stereo image pair.

95

Correct Disparities (%)

90

85

80

75
Tsukuba
Venus
Teddy
Cones
Average

70

65

60

3

5

7

9
11
13
15
Window Size (Pixels)

17

19

21

Figure B.1: Correlation accuracy versus SAD window size.

B.2.1

Zero Mean Images
One of the most common culprits of incorrect, area-based, stereo matching is radiometric

distortion. Radiometric distortion is when the relative brightnesses of the stereo images are mismatched, often due to differences in camera gains or bias. When one image is brighter than the
191

other, the error calculated is much higher than it would otherwise be. This sensitivity to relative
brightness is one of the key weaknesses of some area-based matching techniques.
A simple technique for alleviating the effects of such distortion is to create zero-mean versions of the input images. That is, the mean of each image is calculated then subtracted from
each pixel, resulting in a version of each image where the mean pixel value is zero. When used
with standard correlation techniques, such as SAD or SSD, this method is commonly referred to
as ZSAD or ZSSD. This effectively normalizes the brightnesses of the images. This kind of normalization is very important when the two images from the stereo pair are of different brightness,
possibly due to separate automatic exposure controls. This also suggests the need for a stereo
camera system to attempt to maintain the relative brightness of the two images, perhaps through a
common exposure control.
In reality, use of this technique ensures that the stereo system is invariant with respect to
intensity offset, but does not make the system invariant to camera gain or gamma variation. Nevertheless, it has proven to increase the robustness of area-based matching. This simple preprocessing
method has been studied in several works by Corke, Dunn, and Banks (e.g., [42], [96]) and will
not be discussed further here.

B.2.2

Noise Filters
Another problem that causes incorrect stereo matching is image noise. The noise, being

random, is always different between the two cameras. Even fixed-pattern noise (FPN), in which
specific pixels always have the same incorrect bias or value, will be different between cameras.
Image noise is generally characterized as being either Gaussian noise or salt and pepper noise.
Salt and pepper noise is seen as sparse light and dark pixels throughout the images, such as
the one shown in Figure B.2. In practice, salt and pepper noise is not a significant problem in most
high-quality stereo camera systems. When this type of noise does occur, it is usually due to faulty
pixel sites in the image sensor, for which many modern sensors correct automatically, or flecks of
dust in the camera’s view. The classical method for dealing with salt and pepper noise is through
a median filter [97], which replaces each pixel’s value with the median pixel value of the pixel’s
neighborhood. The need for such a filter depends on the camera system being used and the stereo
method’s susceptibility to such noise.
192

Figure B.3 shows the effect of applying an N × N median filter to a stereo image pair in
the presence of salt and pepper noise. Each curve represents the average accuracy of correlation
when a given percentage of pixels in one image has been replaced by salt and pepper noise. A
median filter window size of N = 1 represents the case where no median filter is applied. As can
be seen from the figure, if the images have little or no salt and pepper noise then the application of
a median filter is actually detrimental to SAD correlation. This is due to the fact that the median
filter removes much of the image texture that allows the correlation method to uniquely identify a
pixel. About 1% image noise is required before the median filter has a positive effect. This is a
very large amount of salt and pepper noise, rarely found in modern digital cameras. An image with
1% salt and pepper noise is shown in Figure B.2. In no case is a median filter larger than 3 × 3
beneficial, after which point the accuracy decreases approximately linearly with median filter size.

Figure B.2: Image with 1% salt and pepper noise.

Gaussian noise is generally seen as noise that changes the value of each pixel from its true
value, usually by a small amount. It is called Gaussian because the amount of distortion usually
follows a normal distribution. In practice, most of the noise found in an image would be considered
Gaussian. Gaussian noise is especially prominent in indoor and other low-light situations where

193

86
No Noise
0.5% Noise
1.0% Noise
1.5% Noise
2.0% Noise

84

Correct Disparities (%)

82
80
78
76
74
72
70
68

1

3

5
Median Filter Size

7

9

Figure B.3: Average correlation accuracy versus median filter size for 13 × 13
SAD and various levels of salt and pepper noise. The median filter size of
N = 1, for an N × N filter, indicates that no median filter was applied.

random electrical noise become more significant. Simple smoothing filters, such as a Gaussian
filter or the simpler mean filter (i.e., window averaging filter) can be used to reduce the noise [97].
To test the effects of a Gaussian smoothing filter on images with Gaussian noise, Gaussian
noise, with a standard deviation that is a known percentage of the pixel range, was added to the
input stereo image pairs. Figure B.4 shows the relationship between the standard deviation (σ ) of
the Gaussian smoothing filter and the average correlation accuracy for various levels of normally
distributed noise. Unlike with salt and pepper noise, an image with a small but perceivable amount
of Gaussian noise (e.g., 1.5%) benefits slightly from a small amount of image smoothing (e.g.,
σ = 0.5). This can be seen from Figure B.4, which shows an increase in average correlation
accuracy for the noisier images and appropriate values of σ . As noise increases, so does the
amount of smoothing needed to counteract it. Benefits are lost if the standard deviation of the
smoothing filter is too aggressive for the amount of noise present in the images. A 3 × 3 mean filter,
for example, represents far too much smoothing and would have an adverse effect on correlation
accuracy for all but extreme cases of image noise.
194

Unfortunately, the correlation accuracy improvement provided by Gaussian smoothing of
noisy images is usually small compared to the loss in accuracy due to the presence of the noise. For
example, the introduction of noise with a standard deviation equal to 2% of the pixel range reduces
the average accuracy from about 84% to about 77%. Smoothing of the noisy images increases
accuracy to just under 78%.

86
No Noise
1.0% SD Noise
1.5% SD Noise
2.0% SD Noise
2.5% SD Noise
3.0% SD Noise

Correct Disparities (%)

84

82

80

78

76

74

72
0.2

0.3

0.4

0.5

0.6
0.7
Sigma

0.8

0.9

1

1.1

Figure B.4: Average correlation accuracy versus Gaussian filter standard deviation (σ ) for 9 × 9 SAD and various levels of normally distributed noise.

The disadvantage of noise removal filters, such as median, Gaussian, and mean filters, is
that they not only remove noise, but they also remove the high-frequency texture that makes regions
of an image identifiable. As a result, the use of such filters often creates areas of insufficient texture,
decreasing discriminability and leading to additional mismatches.
There is also an important relationship between the amount of smoothing and the size of the
correlation window. Since smoothing decreases the amount of high-frequency texture with which
to uniquely identify a template window, we can counteract this effect somewhat by enlarging the
correlation window, increasing the amount of low-frequency content in the correlation window.

195

Since noise removal filters, such as Gaussian smoothing and median filters, have a negative
effect on noiseless images, the characteristic noise of the stereo camera system must be carefully
evaluated in order to determine how much noise removal, if any, is necessary. For typical full
frame-rate camera systems, a small amount of Gaussian noise will be present when the lighting is
sufficiently low, such as in indoor environments. In such situations, a small amount of smoothing
combined with a larger correlation window is beneficial.

B.2.3

Laplacian of Gaussian Filter
One of the disadvantages of area-based matching is the heavy reliance on pixel intensity

values. A more ideal preprocessing filter is one that converts the image to a form that represents
the image content more that just the pixel intensities, in addition to removing noise, thus also
making it more immune to radiometric distortion and vignetting. For this reason, the Laplacian
of Gaussian, or LoG, filter is the most commonly used preprocessing filter for area-based stereo
vision implementations. This preprocessing filter tends to give better results than the previously
discussed filters. Despite its common usage with stereo vision in the literature, justification for the
LoG parameters used is almost never given, leaving it unclear how to best apply the filter for stereo
correlation. This section will describe the effect of the LoG filter parameters.
The LoG filter is essentially a Gaussian smoothing filter followed by a Laplacian filter,
which is the sum of the second derivatives in the vertical and horizontal directions [98]. The
Gaussian smoothing filter is particularly important when combined with the Laplacian, since the
second derivative calculation is highly sensitive to noise. Because the convolution operation is
associative, we can combine the Gaussian and Laplacian filters into a single kernel and apply the
LoG filter using a single convolution. The center-weighted LoG function, with Gaussian standard
deviation σ , can be written as
2



LoG(x, y) = ∇ G(x, y) =


x2 + y2 − 2σ 2 − x2 +y22
e 2σ ,
σ4

(B.13)

where ∇2 is the Laplacian operator and G is the 2D Gaussian function. The LoG filter is also
commonly called the Mexican hat operator, due to the filter’s sombrero-like shape. The negated

196

LoG function is shown in Figure B.5. In practice, the LoG function may be negated, if desired,
since the sign is not usually critical.

2σ
(a) LoG(x, y)

(b) LoG(x, 0)

Figure B.5: The LoG function.

Several factors influence the effectiveness of the LoG filter. The most obvious is the value
of the σ parameter, which controls the spread of the LoG function. Figure B.6(a) shows the
relationship between 11 × 11 SAD correlation accuracy and the σ parameter. A window size of
11 is used because it generally leads to the best results with LoG on our datasets. As can be seen
from the figure, the optimal value for σ depends on the contents of the image. The optimal value
also depends somewhat on the correlation window size, a characteristic not shown in this figure.
Based on the average in Figure B.6(a), it would appear that the optimal value for σ is
between 1.0 and 1.1 for the collective datasets, giving an average correlation accuracy of about
87.3%. For comparison, the accuracy achieved by SAD alone without first applying the LoG
filter was at best 85.7%. The improvement is much more dramatic when the image pairs have a
significant brightness mismatch.
Interestingly, the correlation accuracy for the Cones image does not improve with increased
σ but instead does best with a smaller amount of smoothing, as shown in Figure B.6(a). This seems
to be due to the large number of disparity discontinuities in the Cones image. These discontinuities
cause more mismatches as the object boundaries become increasingly blurred. However, object
boundaries are the best place to have incorrect disparities, since stereo algorithms tend to pick

197

either the foreground disparity or background disparity, both of which could be considered correct
at the object boundary for practical purposes. The effect of controlling both the σ parameter and
correlation window size is shown in Figure B.7.
In the presence of noise, the effect of σ becomes increasingly important. Figure B.6(b)
shows the effect of σ on the stereo images when Gaussian noise, with a standard deviation equal
to 1% of the pixel range, has been added. For this data set, the optimal value of σ is between 1.3
and 1.4, a larger value due to the increased need to remove noise before computing the Laplacian
and performing correlation.
Another factor that influences the effectiveness of the LoG filter is the size of the kernel
used to implement it. The LoG is a bivariate function with infinite domain. For practical reasons,
we must choose a finite window to represent it. Fortunately, the LoG filter approaches zero fairly
rapidly relative to the value of σ , so a large kernel is not necessary to approximate it accurately.
√
As shown in Figure B.5(b), the zero crossings of the LoG function occur at σ 2, and the function
√
is essentially zero by 3σ 2. Therefore, with σ = 1, a kernel size of 9 × 9 is more than sufficient
to capture the LoG function for practical purposes.
Figure B.8 shows the relationship between the LoG kernel size and SAD correlation accuracy. As the figure shows, a kernel size of 3 × 3 is simply too small to reasonably capture the LoG
function with σ = 1. A 5 × 5 kernel performs much better and kernel sizes of 7 × 7 and larger
perform essentially as well as an infinite kernel.
In order to increase the efficiency of the LoG filter application, it is common practice to
approximate the LoG kernel using integers, thus avoiding higher precision floating-point or fixedpoint arithmetic. Several LoG kernel approximations have been reported in the literature, in various
sizes. Several common approximations are shown in Figure B.9. A popular 7 × 7 LoG approximation was not found in the literature.
Many of these traditional kernels do not work very well as preprocessors for stereo correspondence. The 3 × 3 kernels are simply too small to capture a meaningful LoG approximation
and the σ of the 5 × 5 kernel is too small. Generation of new kernels is a simple matter of scaling
and rounding the exact N × N kernel to find one with a mean value of zero. A simple program can
be used to find them and values can be adjusted by hand to obtain zero mean or to achieve desired
coefficient characteristics (e.g., powers of two, small number of bits, etc.) if necessary.
198

Correct Disparities (%)

95

90

Tsukuba
Venus
Teddy
Cones
Average

85

80

0.6

0.8

1

1.2
Sigma

1.4

1.6

1.8

2

(a) Original Images
88

86

Correct Disparities (%)

84

82

80

78
Tsukuba
Venus
Teddy
Cones
Average

76

74

0.6

0.8

1

1.2
Sigma

1.4

1.6

1.8

2

(b) Images with Noise

Figure B.6: Correlation accuracy versus LoG σ . Correlation performed using
LoG filter followed by 11 × 11 SAD on the original and noisy (1% standard
deviation) versions of the stereo image pairs.
199

Correct Disparities (%)

88
X: 1.0
Y: 11
Z: 87.27

86
84
82
80
78
21
19
17
15
13
SAD Window Size

11
9
7
5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

Sigma

Figure B.7: Average correlation accuracy versus LoG σ and SAD window size.
The peak of the surface is marked.

94
92

Correct Disparities (%)

90
88
86
84
Tsukuba
Venus
Teddy
Cones
Average

82
80
78

3

5

7
Kernel Size

9

11

Figure B.8: Correlation accuracy versus LoG kernel size for 9 × 9 SAD and σ = 1.

200

0
1
0

1
-4
1

0
1
0

1
1
1

(a) 3 × 3

0
0
-1
0
0

0
-1
-2
-1
0

-1
-2
16
-2
-1

0
-1
-2
-1
0

1
-8
1

1
1
1

-1
2
-1

(b) 3 × 3

0
1
1
2
2
2
1
1
0

0
0
-1
0
0

(d) 5 × 5

1
2
4
5
5
5
4
2
1

2
-4
1

-1
2
-1

(c) 3 × 3

1
4
5
3
0
3
5
4
1

2
5
3
-12
-25
-12
3
5
2

2
5
0
-24
-40
-24
0
5
2

2
5
3
-12
-25
-12
3
5
2

1
4
5
3
0
3
5
4
1

1
2
4
5
5
5
4
2
1

0
1
1
2
2
2
1
1
0

(e) 9 × 9

Figure B.9: Common LoG kernel approximations.

Improved 5 × 5 and 7 × 7 kernels for σ ≈ 1.0 are proposed in Figure B.10. A 9 × 9 kernel
could also be generated, but kernels larger than 7 × 7 have an insignificant effect on the results of
our datasets. These kernels have a more appropriate size and σ value than those of Figure B.9.
Many other kernel approximations are possible.

1
2
2
2
1

2
0
-4
0
2

2
-4
-12
-4
2

2
0
-4
0
2

1
2
2
2
1

1
2
2
2
1

(a) Custom 5 × 5

0
0
1
1
1
0
0

0
2
3
4
3
2
0

1
3
0
-8
0
3
1

1
4
-8
-28
-8
4
1

1
3
0
-8
0
3
1

0
2
3
4
3
2
0

2
0
-4
0
2

2
-4
-16
-4
2

2
0
-4
0
2

1
2
2
2
1

(b) Enhanced 5 × 5

0
0
1
1
1
0
0

0
0
1
1
1
0
0

(c) Custom 7 × 7

0
2
2
4
2
2
0

1
2
0
-8
0
2
1

1
4
-8
-32
-8
4
1

1
2
0
-8
0
2
1

(d) Enhanced 7 × 7

Figure B.10: Improved LoG kernel approximations.

201

0
2
2
4
2
2
0

0
0
1
1
1
0
0

Table B.1: Correlation Accuracy for Various LoG Kernels
Kernel
Exact
Fig. B.9(a)
Fig. B.9(b)
Fig. B.9(c)
Fig. B.9(d)
Fig. B.10(a)
Fig. B.10(b)
Fig. B.10(c)
Fig. B.10(d)
Fig. B.9(e)

Correct (%)
87.27
82.24
83.62
70.13
85.12
87.11
87.48
87.32
87.63
87.09

Description
Double-precision floating-point calculation
Classic 3 × 3 approximation
Classic 3 × 3 approximation
Classic 3 × 3 approximation
Classic 5 × 5 approximation
Custom 5 × 5 (σ = 1.0)
Enhanced 5 × 5 for power-of-2 arithmetic
Custom 7 × 7 (σ = 1.0)
Enhanced 7 × 7 for power-of-2 arithmetic
Classic 9 × 9 approximation (σ = 1.4)

The kernels of Figures B.10(a) and B.10(c) are precisely scaled and rounded versions of the
LoG function for σ = 1.0. The kernels of Figures B.10(b) and B.10(d) are enhanced versions of
the other two kernels in the figure. As you can see, the coefficients have been modified to simplify
the arithmetic involved—allowing multiplications to be replaced with binary shifts. Such a change
of coefficients will modify the represented value of σ somewhat, but the optimal choice of σ is
already image dependent.
Despite the two enhanced kernels of Figure B.10 being computationally simpler than the
more exact LoG representations, these kernels actually perform marginally better on the datasets.
In fact, the 5 × 5 and 7 × 7 kernel of Figures B.10(b) and B.10(d) perform better than the ideal LoG
filter for all values of σ —the best ideal LoG filter achieved 87.29% accuracy with σ = 1.02 and
was computed using double-precision floating-point arithmetic. This indicates that other kernels
may exist that give even better performance, and suggests a possible focus for future research.
The average performance of all the kernel approximations for the proposed image datasets
is shown in Table B.1. The enhanced kernels perform at least as well as the ideal kernels in practice.

B.2.4

The Rank Transform
Two non-parametric transforms were introduced in Section 2.4, called the rank transform

(Equation 2.1) and the census transform (Equation 2.3). Due to the strong interdependence of
the census transform with the correlation method, the census transform will be discussed with
202

correlation methods in Section B.3. The rank transform, however, can be used with virtually any
area-based correlation method. The goal of these non-parametric transforms is to shift the matching
from reliance on image intensities to reliance on intensity ordering. Intensity ordering tends to be
must less affected by noise and other image distortions, making the rank transform a very effective
preprocessing technique for stereo vision.
One disadvantage of the rank transform is that it does not discriminate well between image
regions. For example, the rank transform of a 5 × 5 window is a single integer value in the range
of [0,24]. Additionally, the rank does not maintain any information about the order in which pixel
intensities appear, allowing for many very different pixel configurations to have the same rank
value. These characteristics make it impractical to uniquely identify a small window using its rank
transform alone.
However, the rank transform is not used by itself as a correlation method, but rather as
a preprocessing step for an area-based correlation method. Therefore, we can compensate for the
low discriminatory power of the rank transform by employing a sufficiently large window size with
the subsequent correlation method.
Figure B.11 shows the average SAD correlation accuracy for the image datasets with various rank transform and SAD correlation window sizes. The highest average correlation accuracy
is achieved with a rank transform size of 7 and a SAD correlation window size of 17, for which the
correlation accuracy is 90.21%. This is a much larger SAD window size than the optimal size of
13 when SAD is used without any preprocessing (Figure B.1). However, the rank transform also
performs well with smaller SAD window sizes; 90.02% accuracy is achieved when the window is
reduced to 13.
One interesting characteristic of the rank transform is its sensitivity to the contents of the
stereo image pair. Figure B.12(a) shows the correlation accuracy for the four individual datasets
and a fixed SAD window size of 13 × 13. Notice that the optimal rank size varies by a surprising
amount depending on the image dataset. Much of the early research on the rank transform for
stereo correlation was based on just one or two image pairs, such as Tsukuba or a random dot
stereogram, making many of the previous results somewhat misleading for images in general. The
variability also makes finding the right rank window size more challenging. Despite this variability,
the rank transform is an excellent image preprocessor. For example, with a fixed rank transform
203

Correct Disparities (%)

92

X: 7
Y: 17
Z: 90.21

90
88
86
84
82
80
23

21

19

17

15

SAD Window Size

13

11

9

7

3

5

7

9

11

13

15

17

19

21

Rank Transform Size

Figure B.11: Average correlation accuracy versus rank transform size and SAD
window size. The peak of the surface is marked.

size of 9 × 9 and a 13 × 13 correlation window, which are optimal for none of the benchmark
datasets, the rank performs better than LoG as a preprocessor for all image datasets regardless of
the LoG filter’s σ value and correlation window size. This can be seen by comparing Figure B.6(a)
with Figure B.12(a)
Correlation with the rank transform, like all preprocessors, is not immune to the effects of
noise. Figure B.12(b) shows the performance of the rank transform when noise with a standard
deviation equal to 1% of the pixel range is added to the image. We can counteract this effect by
first applying a Gaussian smoothing filter before the rank transform. For example, first applying
a Gaussian filter with σ = 0.5 increases the peak accuracy for the noisy datasets from 85.5% to
86.3%.
Another significant advantage of the rank transform is that it reduces the data width required
to fully represent the processed image. This reduction in data width for typical rank transform sizes
makes the subsequent correlation method much more computationally inexpensive in a custom

204

100

Correct Disparities (%)

95

90

85

80

Tsukuba
Venus
Teddy
Cones
Average

75

70

3

5

7

9

11
13
15
Transform Size

17

19

21

(a) Original Images
100

Correct Disparities (%)

95

90

85

80
Tsukuba
Venus
Teddy
Cones
Average

75

70

3

5

7

9

11
13
15
Transform Size

17

19

21

(b) Images with Noise

Figure B.12: Correlation accuracy versus rank transform size using 13 × 13
SAD on original and noisy (1% standard deviation) image datasets.

205

Table B.2: Data Width Requirements for Rank-Transformed Images
Rank Transform Size
3×3
5×5
7×7
9×9
11 × 11
13 × 13
15 × 15

Output Range
[0, 8]
[0, 24]
[0, 48]
[0, 80]
[0, 120]
[0, 168]
[0, 224]

Data Width
4
5
6
7
7
8
8

hardware implementation, which can easily accommodate and benefit from arbitrary data widths.
The data widths required to represent rank-transformed images are shown in Table B.2.

B.2.5

Preprocessing Combinations
Another preprocessing possibility is the combination of one or more preprocessing steps.

Several combinations are possible, including but not limited to:
• Median followed by LoG
• LoG followed by median
• Gaussian followed by rank
• Rank followed by Gaussian
• Rank followed by LoG
• LoG followed by rank
Combinations involving three or more preprocessing steps are also possible. Due to the large number of possible combinations, data comparing these methods will not be included here. Fortunately,
the effects can be summarized rather simply.
As a general rule, combining a superior preprocessing method (e.g., rank transform) with
an inferior one (e.g., LoG) will provide better results than the inferior preprocessor alone, but will
not perform as well as the superior preprocessor alone.
206

The exception to this rule is in the presence of specific image distortions for which the generally inferior preprocessor is especially well suited. For example, as discussed in Section B.2.4,
a stereo image pair with noise will benefit from a Gaussian smoothing filter prior to applying the
rank transform. In other words, it is best not to combine preprocessing methods unless there is a
specific need for an additional preprocessor due to the characteristics of the stereo camera system.

B.2.6

Summary of Preprocessing
Based on the results of this section, we can infer many useful properties of the preprocess-

ing filters studied. First, using zero mean images is very effective in situations where the brightness of one of the stereo pairs cannot be accurately controlled [42], [96]. However, for most image
pairs, this is not a significant problem and other preprocessing methods exist which outperform
zero mean images.
Filters to reduce the noise are useful during preprocessing only if there is a significant
amount of noise in the image. Otherwise the effect of the filters is to reduce high-frequency texture
useful for correlation matching. Median filters are helpful only in cases of extreme noise, which
generally do not occur with modern camera systems.
The Laplacian of Gaussian (LoG) filter is the most commonly used preprocessor for areabased stereo vision. It was shown to improve the results of correlation for general images. Other
works have also shown that the LoG is fairly immune to common image distortions [46]. In the
presence of image noise, it is sufficient to adjust the σ parameter to increase the smoothing integrated into the LoG. The most useful LoG filters can be effectively approximated using relatively
small kernels, such as 5 × 5 or 7 × 7, making it fairly computationally efficient. Interestingly, of
all the LoG filters tested, the best performance was achieved with a custom, 7 × 7 approximation
of the ideal LoG filter, suggesting that further improvements may be possible with other similar
kernels.
The rank transform provides even better results than the LoG for our datasets, and has the
same immunity to common image distortions. The accuracy of the correlation following the rank
transform is much more sensitive to image content, which makes the choice of an ideal rank size
more complicated. Additionally, the rank transform works slightly better when the subsequent
correlation uses a relatively large window size, such as 17 × 17, compared to 11 × 11 for the LoG.
207

Table B.3: Preprocessing Method Summary
Method
SAD Only
LoG→SAD
Rank→SAD

Optimal Parameters
13 × 13 Window
11 × 11 Window, 7 × 7 LoG, σ ≈ 1.0
17 × 17 Window, 7 × 7 Rank

Accuracy (%)
85.72
87.63
90.21

This can make the rank transform implementation more resource intensive. However, the rank
transform does have the advantage that it can be computed using only additions and comparisons,
simplifying the implementation on many computing platforms. The rank transformed images also
usually require fewer bits to represent, simplifying some of the components needed for a custom
hardware implementation.
The average performance of the preprocessing methods for all four datasets is shown in
Table B.3. The parameters used to achieve the results are also shown. The LoG→SAD method
used the LoG kernel approximation of Figure B.10(d).
It is important to note, as we saw with the data in this section, that the optimal parameters
are dependent on the images used. Therefore, these results should not be taken as ultimate truths,
but rather as guidelines in the selection of appropriate parameters for a given stereo vision system.

B.3

Correlation Methods
The correlation stage is the core of an area-based stereo vision system. It is in this step

that pixels are initially assigned disparity estimates, based on the similarity between windows in
the stereo image pairs. Many correlation methods and similarity metrics have been proposed in
the literature. In addition, researchers have continued to improve upon these methods and have
proposed many modifications and optimizations to improve the accuracy for the purpose of realtime, area-based, stereo correlation. These methods will be discussed and compared in detail in
this section.

208

B.3.1

Classical Similarity Measures
There are three area-based, stereo correlation methods cited most frequently in the litera-

ture, differentiated only by the similarity measure they use for stereo correlation. These are called
SAD (Sum of Absolute Differences), SSD (Sum of Squared Differences), and NCC (Normalized
Cross Correlation) [42]. These three similarity measures are defined by the following mathematical
equations:
SAD :

∑

|I1 (x1 + i, y1 + j) − I2 (x2 + i, y2 + j)|,

(B.14)

[I1 (x1 + i, y1 + j) − I2 (x2 + i, y2 + j)]2 ,

(B.15)

(i, j)∈W

SSD :

∑
(i, j)∈W

I1 (x1 + i, y1 + j)·I2 (x2 + i, y2 + j)

∑
(i, j)∈W

NCC :

r
∑
(i, j)∈W

I12 (x1 + i, y1 + j)·

∑
(i, j)∈W

I22 (x2 + i, y2 + j)

.

(B.16)

The SAD measure is mathematically equivalent to the L1 distance, also called rectilinear,
city-block, or Manhattan distance. Combined with standard area-based correlation, this is the
similarity metric we have been using thus far to evaluate preprocessing methods. Closely related
to SAD, the SSD measure is based on the L2 distance, or Euclidean distance. However, for the
purposes of correlation, it is not necessary to take the square root of the sum, since the square root
is a monotonically increasing function and the standard correlation procedure will simply choose
the lowest dissimilarity value as the best match.
One of the challenges of these two measures is that their scale is dependent upon the images
and the size of the windows, making it difficult to immediately tell how good a match is. The more
complex NCC measure attempts to normalize the comparison, yielding the value 1.0 for a perfect
match. NCC also differs in that it is a measure of similarity, rather than dissimilarity. NCC is also
commonly combined with zero-mean imagery (Section B.2.1), and is often written in the form
∑
ZNCC :

 


I1 (x1 + i, y1 + j) − I1 · I2 (x2 + i, y2 + j) − I2

(i, j)∈W

r


2

2 ,
I
(x
+
i,
y
+
j)
−
I
·
I
(x
+
i,
y
+
j)
−
I
∑
∑
1 1
1
1
2 2
2
2

(i, j)∈W

(i, j)∈W

209

(B.17)

where In is the mean value of image In . In the literature, the term NCC is commonly used to refer
to both Equations B.16 and B.17; however, the latter is often distinguished by the name ZNCC,
for zero-mean NCC [36], [42]. The NCC measure, as defined above, is normalized with respect
to variance whereas the ZNCC measure is normalized in terms of both the mean and the variance,
making it relatively insensitive to both gain and bias [36].
Many variations of these three classical measures exist, differing generally by the method
of normalization. Some common variations include ZSAD (Zero-mean SAD), ZSSD (Zero-mean
SSD), and NSSD (Normalized SSD). All of these variations are more computationally expensive,
but provide increased robustness to image distortions.
Of the three classical similarity measures, SAD is computationally the simplest. Since
stereo correlation is such a computationally expensive operation, SAD also seems to be the most
commonly applied of the classical measures, particularly in real-time applications. Additionally,
despite its simplicity, SAD performs very well compared to the other three measures. The performance of the SAD, SSD, and NCC correlation measures is shown in Figures B.13–B.15. Figure B.16 shows the average for the three measures in the same plot.

95

Correct Disparities (%)

90

85

80

75
Tsukuba
Venus
Teddy
Cones
Average

70

65

60

3

5

7

9
11
13
15
Window Size (Pixels)

17

19

21

Figure B.13: Correlation accuracy versus SAD window size.

210

95

Correct Disparities (%)

90

85

80

75

Tsukuba
Venus
Teddy
Cones
Average

70

65

60

3

5

7

9
11
13
15
Window Size (Pixels)

17

19

21

Figure B.14: Correlation accuracy versus SSD window size.

The results for SAD and SSD are quite similar, however SAD seems to achieve slightly
better results than SSD for the four datasets, a fact observed by several other researchers. SSD
achieves its optimal results with a slightly smaller window size than SAD (11 versus 13), due to
its ability to more aggressively discriminate against erroneous pixels.
NCC generally performs better than both SAD and SSD and achieves its best results with
an even smaller window size of 9 on average for our datasets. An interesting characteristic of NCC
is its increased sensitivity to the dataset when compared to the other two measures. As shown in
Figure B.15, there is more variation in the shape of the curve and the optimal parameters for each
dataset. In contrast, SAD and SSD behave similarly for each dataset, with a variation in optimal
window size of just one or two window size steps.
Unfortunately, NCC is very computationally expensive, as indicated by the form of Equation B.16. Assuming the window size is N × N, each NCC similarity computation requires 3(N 2 −
1) additions, 3N 2 + 1 multiplications, one square root, and one division. In contrast, the SAD measure, shown in Equation B.14, requires only N 2 − 1 addition operations, N 2 subtraction operations,

211

95

Correct Disparities (%)

90

85

80

75
Tsukuba
Venus
Teddy
Cones
Average

70

65

60

3

5

7

9
11
13
15
Window Size (Pixels)

17

19

21

Figure B.15: Correlation accuracy versus NCC window size.

88
86

Correct Disparities (%)

84
82
80
78
76
74
72

SAD
SSD
NCC

70
68

3

5

7

9
11
13
15
Window Size (Pixels)

17

19

21

Figure B.16: Classical methods average correlation accuracy comparison.

212

and N 2 absolute value computations. Furthermore, the types of computation required by NCC (i.e.,
multiplication, square root, and division) are much more expensive than the operations of SAD.
Another important consideration is how the three measures compare when subject to noise.
This is particularly important in light of the fact that SSD uses squared error, which tends to amplify
noise error and increase the number of mismatches in the presence of noise. Cross correlation also
tends to be highly susceptible to noise. Since SAD only sums absolute error, we might expect it to
be somewhat more resilient in the presence of image noise.
Figure B.17 shows the dataset averages for the three correlation methods in the presence
of 1% standard deviation noise, a faintly perceptible amount when viewed with the human eye.
In this figure we see that the performance of the three methods is much closer in the presence of
noise, with the benefits of NCC being severely diminished. All three methods benefit from a larger
window size, which tends to average out the effect of the noise. Additionally, the highest accuracy
is delivered by SAD. In real-time and indoor applications, noise tends to be higher due to the
required combinations of low exposure time and/or high pixel gains. Thus, SAD has the double
benefit of being computationally less expensive for real-time applications and having superior
robustness in the presence of image noise.
The curves shown in the figures of this section highlight an interesting property of areabased correlation methods, which are based on the matching of small image windows. The accuracy of these correlation methods increases with window size up to a point, then begins to decrease
with increased window size.
As we initially increase the window size, we increase the ability of the similarity measure
to accurately discriminate between windows and average out the effects of noise and image quantization. However, area-based correlation tends to perform more poorly at disparity discontinuities.
This is because correlation windows near these discontinuities will overlap both foreground and
background objects, with the background appearing different in each image due to parallax. A
larger correlation window will more often overlap regions of differing disparity, causing an increased number of correlation mismatches near object boundaries. As a result we see diminishing
returns with increased correlation window size. The inflection point, which represents the optimal
window size, is dependent on the input images.

213

85

Correct Disparities (%)

80

75

70

65

60
SAD
SSD
NCC

55

50

3

5

7

9
11
13
15
Window Size (Pixels)

17

19

21

Figure B.17: Classical methods correlation accuracy with 1% standard deviation noise.
B.3.2

Non-Uniform Windows
One of the great assumptions of the stereo vision community is the use of square, or at

least rectangular, windows. Such windows are used almost universally for real-time applications,
due to the increased difficulty of computing similarities for arbitrarily shaped windows on generalpurpose computers. Uniformly-weighted windows are also the norm, where each pixel of the
template window is considered to be of equal importance.
An arbitrary window shape and/or weighting generally requires a separate kernel to be
read in order to determine the relative importance of each pixel to be included in the similarity
measure, greatly increasing the computational burden. Additionally, a non-square window with
arbitrary pixel weights prevents the use of certain window summing optimizations for computing
the similarity metric. Such optimizations will be described in Apendix C. Yet, even for applications which do not require high performance, the square, uniformly-weighted window is the norm.
Despite the increased computational complexity, some researchers have recognized the importance
of non-uniformly weighted windows (e.g., [99]).

214

The performance limitations of arbitrary window shape and weighting do not exist in all
architectures. In particular, when custom hardware is used and data is sufficiently parallelized, an
arbitrary window shape does not affect the throughput of the system. Additionally, arbitrary pixel
weights can be applied with no performance overhead. If the weights are convenient powers of
two, the resource requirements for implementation may also be small.
As we have seen, the accuracy of the correlation method is strongly dependent on the
window size employed, where the optimal window size is a compromise between covering as
much area as possible and reducing the amount of overlap with disparity discontinuities. The
further a pixel is from the center of the window, the less likely it is overlapping the same object in
the image and the less likely it has the same disparity. Conversely, the closer a pixel is to the center
of the window, the more likely it overlaps the same object and has the same disparity. In a similar
fashion, the closer a pixel is to the center of the window, the more likely it is to appear the same
in the corresponding window of the other image. Thus it would seem that the optimal window is
a circular one that weighs pixels closer to the center of the window more than the pixels further
from the center of the window. A seemingly natural choice for the weighting of such a window is
the Gaussian function.
Figure B.18 shows the accuracy of SAD correlation when a Gaussian-weighted window
is used in place of the traditional, uniformly-weighted, square window, for varying values of σ .
The average accuracy shown in this figure is not a significant improvement over that of a square
window (Figure B.13). Even if the improvement were significant, the optimal value for σ is around
3.7, which requires a fairly large window to accurately approximate, thus requiring significant
computation.
We can reduce the computational load by limiting the size of the window, essentially truncating the Gaussian function. For these experiments, the value of σ was set relative to the size
of the window. The relative sizes of the window and σ were then varied to find the ratio of σ
to window radius that gives the highest correlation accuracy. The optimal ratio was found to be
around 0.75.
Figure B.18 shows the correlation accuracy for a Gaussian-weighted window when σ is
set to be 0.75 of the window radius, this time comparing the average accuracy for the datasets to
the average for a uniformly-weighted, square window. Again we see that the Gaussian-weighted
215

95

Correct Disparities (%)

90

85

80

75
Tsukuba
Venus
Teddy
Cones
Average

70

65

1

2

3

4

Sigma

Figure B.18: SAD Correlation accuracy with Gaussian-weighted window.

window has a slightly better peak accuracy than the classical square window. However, achieving
this peak accuracy still requires a larger window, and therefore a higher computational burden.
The reason for the superior performance of the square window for small window sizes is
a direct effect of the need to balance window size for optimal accuracy. For smaller windows,
the unit weighting of pixels near the window edges and corners effectively makes the small square
window larger than a similarly sized Gaussian-weighted window. This expanded coverage near the
edges and corners allows the square window to more accurately identify image regions. However,
as the window size is increased, these edges and corners actually become a hinderance, as they are
more likely to overlap other objects at different disparities, thus partially corrupting the similarity
measure for the window.
The seemingly ideal Gaussian-weighted window does generally achieve higher correlation
accuracy than the classical square window, but at a higher computational cost either in terms of
hardware resources required or in raw execution time. However, the increase in accuracy is often
negligible, making the square window not only convenient from an implementation standpoint, but
also an excellent balance between computational demand and correlation accuracy.

216

88
87

Correct Disparities (%)

86
85
84
83
82
81
80
Square Window
Gassian Weighted Window

79
78

5

7

9

11
13
15
Window Size (Pixels)

17

19

21

Figure B.19: Square window and Gaussian-weighted window (σ is 0.75 the
window radius), SAD correlation accuracy comparison.

B.3.3

Sparse Correlation Windows
Another unnecessary assumption that has generally been made by the stereo vision commu-

nity is the inclusion of all pixels within the correlation window for the computation of the similarity
measure. In some implementations, we can reduce the amount of computation required to evaluate
the similarity measure by including only a subset of the pixels from the correlation window in the
computation. The effect of a sparse correlation window on correlation accuracy and the effect on
hardware resource requirements has not previously been studied.
On general-purpose computers, the use of an arbitrary correlation window may actually
increase the execution time of the similarity measure computation, depending on the correlation
window chosen. This is due to the increased overhead imposed by the requirement to selectively
read only a specific subset of pixels. The performance is also influenced by the summing optimizations employed, which are introduced in Section A.7. Correlation window subsets must be
carefully chosen or they will not be compatible with the window summing optimization.

217

These effects would seem to be the most likely reason why such correlation windows have
not been studied in the literature. However, custom hardware implementations do not have the
same constraints as general purpose computers, allowing us to easily achieve the same level of performance regardless of the correlation window chosen. When combined with a window summing
optimization in a custom hardware implementation, use of a sparse correlation window allows us
to balance the trade off between memory and computation resource requirements.
To test the effect of a sparse correlation window, we will study its effect on SAD, the most
common correlation method for real-time implementations. We can redefine Equation B.14, the
SAD similarity measure, to sum over an arbitrary subset of the window, Ŵ , rather than the entire
window W .
For these experiments, various sparse window configurations were tested. One obvious
choice is to include only about 50% of the correlation window using a checkerboard pattern. This
and other sparse window configurations are shown in Figure B.20.

Normal Window (100%)

50% Cover

34% Cover

27% Cover

Figure B.20: Sparse 13 × 13 correlation windows of various densities. Only the darkened
pixels are included in the similarity measure.

The SAD correlation accuracy when using the 50% sparse correlation window of Figure B.20 is shown in Figure B.21(b). Compared to classical SAD (Figure B.21(a)), there is relatively little difference in correlation accuracy for reasonably large window sizes.
Figure B.22 overlays the average correlation accuracy for SAD using a normal window
with the average accuracy of the 50% sparse window. As shown in the figure, the smaller the
window size, the more detrimental the effect of the sparse window on correlation accuracy. This is
due to the small number of pixels that the sparse window uses for the computation of the similarity
measure. However, as the window size increases, the correlation accuracy of the sparse window
218

95

90

90

85

85

Correct Disparities (%)

Correct Disparities (%)

95

80

75
Tsukuba
Venus
Teddy
Cones
Average

70

65

60

3

5

7

9
11
13
15
Window Size (Pixels)

17

19

80

75
Tsukuba
Venus
Teddy
Cones
Average

70

65

60

21

3

(a) Normal Window

5

7

9
11
13
15
Window Size (Pixels)

17

19

21

(b) 50% Sparse Window

Figure B.21: Normal and 50% sparse SAD window correlation accuracy comparison.

Table B.4: Average Correlation Accuracy for Sparse 13 × 13 SAD
Window
Full (100%)
50% Cover
34% Cover
27% Cover

Average Accuracy
85.72%
85.36%
84.82%
84.44%

Degradation
0.42%
1.05%
1.50%

approaches that of the full window. For the optimal window size of 13 × 13, the accuracy of the
sparse window is only 0.42% lower than the full window.
Table B.4 shows the correlation accuracy for all the sparse windows of Figure B.20. As the
correlation window becomes more sparse, the loss in accuracy is increased.
There is a large number of other sparse windows that could be considered, besides those
of Figure B.20. The higher the density of the pixel distribution in the sparse window, the more
closely the correlation accuracy matches the normal window. The lower the density, the larger the
window needs to be in order to make up for the reduction in density. Thus, the selection of a sparse
window allows us to balance the computational load of the similarity measure with the accuracy
of the correlation.
219

90

Correct Disparities (%)

85

80

75

70
Normal Window
Sparse Window
65

3

5

7

9
11
13
15
Window Size (Pixels)

17

19

21

Figure B.22: Classical and sparse window comparison. The average accuracy
for the datasets for both window types is shown.

In the simplest implementation, the number of addition, subtraction, and absolute value
operations for the SAD similarity metric is proportional to the number of pixels in the window
(N 2 ). The 50% sparse window of Figure B.20 effectively reduces the number of pixels in the
window by half, requiring approximately 50% less computation to evaluate. In a custom hardware
implementation, this could result in a roughly 50% reduction in the hardware resources required
for the implementation of the similarity measure. Even more dramatic area savings could occur
if a more sparse correlation window is used, at the expense of a further decrease in correlation
accuracy.
In a more optimized implementation, many of the computations for one SAD window
can be reused in the computations for subsequent windows because of the overlap between the
correlation windows of adjacent pixels. This optimization, and some variations, is described in
more detail in Section A.7. Unfortunately, not all variations of the window summing optimization
are compatible with all sparse window configurations. As a result, the reduction of hardware

220

resources required for a custom hardware implementation will depend on the nature of the window
summing optimization being employed.

B.3.4

Reduced Image Data Width
Another simple optimization that can be made is to reduce the number of bits used to

represent the images. Such is the approach taken by Kanade et al. in the development of their
stereo vision machine [62]. In a general-purpose computer, bit width reduction below eight bits is
rarely helpful, since a single eight-bit byte is normally the smallest data type on which a computer
can operate. For custom hardware, however, arbitrary data widths are the norm. When a reduction
in data width is possible, it generally results in a proportional decrease in hardware resources as
well as a decrease in critical path timing for the circuit.
For our four stereo image datasets, the input images are high quality, 8-bit images. Figure B.23 shows the average accuracy of SAD correlation for the stereo datasets when the input
stereo image pairs are reduced to data widths ranging from 4 to 8 bits. In the figure, we see that
the reduction to 7-bit images has a very small effect on correlation accuracy. The reduction to
6-bit images also incurs a relatively small penalty, especially for large window sizes. Reductions
beyond this can be detrimental to the quality of the similarity measure. We can compensate for the
loss in precision somewhat through the use of a larger window size, although this can increase the
computational or resources costs, defeating the purpose of the data-width reduction.
Such data width reduction may be more appropriate after initial preprocessing, as is done
by Kanade [62], where they reported that the reduction from 8 to 4 bits on the LoG output has little
difference on the resulting disparity maps, although they do not provide quantifiable evidence of
this assertion.
Other preprocessing methods have an inherent ability to reduce the data width. As discussed in Section B.2.4, the rank transform generally produces values requiring a lower data width
than that of an 8-bit image. For example, the 7 × 7 rank transform, shown to be the most effective
on our datasets in Section B.2.4, results in a 6-bit data size. In a custom hardware implementation,
such reductions in data width result in significant hardware resource savings.

221

90
85

Correct Disparities (%)

80
75
70
65
60

8−bit
7−bit
6−bit
5−bit
4−bit

55
50
45

3

5

7

9
11
13
15
Window Size (Pixels)

17

19

21

Figure B.23: Average correlation accuracy for 13 × 13 SAD
with reduced pixel data width.

B.3.5

Multiple Window Methods
In Section 2.4, the adaptive window (AW) method [43], the symmetric multiple window

(SMW) method [44], and the multiple supporting window (MSW) method [45] were introduced.
All of these are correlation methods that use multiple windows combined with standard similarity
measures in order to achieve improved correlation results. The central idea of the multiple-window
methods is to choose a superior correlation window, or set of windows, near disparity discontinuities. This improves the accuracy of correlation for the pixels that lay near object boundaries,
which are troublesome for conventional correlation methods. The AW method is too computationally intensive for a compact, real-time implementation. Moreover, it has been shown to be inferior
to other simpler multiple-window methods, such as SMW [44], and hence will not be considered
in this section. The performance of the SMW method will be discussed first.

222

Symmetric Multiple Windows
For these experiments, the standard, nine-window configuration of Figure 2.1 is used, as
proposed by Fusiello. However, we will use the SAD similarity metric, since it has been shown
to provide results that are generally as good or better than SSD, which was used in the original
implementation. For the multiple-window methods, we must also distinguish between the window
size and the subwindow size. For the SMW correlation method, given a subwindow size of NS ×NS ,
the effective overall window size, N×N is given by
N = 2·NS − 1.

(B.18)

It is important to keep this relationship in mind when comparing window sizes and the computational complexity of this multiple-window method with conventional, single-window methods.
Figure B.24 shows the performance of the SMW method for the benchmark datasets for a
variety of SMW subwindow sizes.

100
95

Correct Disparities (%)

90
85
80
75
70
Tsukuba
Venus
Teddy
Cones
Average

65
60
55

3

5

7

9
11
13
15
Subwindow Size (Pixels)

17

19

21

Figure B.24: Correlation accuracy versus SMW subwindow size.

223

The best performance achieved by SMW for the datasets is slightly higher than that of
SAD alone (Figure B.13). The improved accuracy is due to the SMW method’s inherent ability to
choose a better window rather than simply relying on a centered window. Unfortunately there are
significant disadvantages to the SMW method. In order to achieve the highest performance, the
size of the subwindow used must be relatively large compared to the window size of SAD (e.g.,
17 × 17 compared to 13 × 13). Additionally, when using SMW, the similarity measure must be
computed for a subwindow of this larger size nine times, instead of just one. This makes the naive
implementation of the SMW method at least nine times more expensive than SAD alone.
One possible optimization would be to note the overlap that occurs between windows.
We can then compute the absolute difference between each pair of pixels once and combine the
absolute differences to form the sums needed for each subwindow. In a custom hardware implementation, such an optimization can be implemented at the expense of only additional routing,
significantly reducing the number of absolute difference hardware components. This effectively
reduces the computational load from computing the SAD similarity measure for nine NS × NS
windows to computing SAD for one (2NS − 1) × (2NS − 1) window.
Another deficiency of the SMW method is that all windows are considered equal and only
one will actually be chosen and used for its SAD similarity measure. As it turns out, the center
window is more important and should always be included somehow or weighted more heavily in
the correlation measure.
To demonstrate the importance of the center window, I propose a modified version of the
SMW method. In this method, the SAD similarity for the center window is always included and
the smallest SAD measure of the remaining eight windows is added to give the final similarity
measure. Figure B.25 shows the accuracy of this modified SMW method.
The average peak performance of the original SMW is slightly higher than the proposed
modified SMW. However, the modified version achieves its performance with a much smaller
subwindow size—13 instead of 17.
The modified version has other desirable qualities. A qualitative comparison of the original SMW and modified SMW methods reveals that the modified version tends to result in much
smoother, more natural disparity maps. The resulting disparity maps for the original and modified
methods for the Tsukuba dataset are shown in Figure B.26. Notice the speckled pattern that appears
224

95

Correct Disparities (%)

90

85

80

75
Tsukuba
Venus
Teddy
Cones
Average

70

65

60

3

5

7

9
11
13
15
Subwindow Size (Pixels)

17

19

21

Figure B.25: Correlation accuracy versus modified SMW subwindow size. The
similarity of the center window is always included.

throughout the disparity map produced by the original SMW implementation (Figure B.26(a)). The
same effect is observable in the images of the original paper on the SMW method [44].
This pattern is the result of non-gross errors in the disparity map caused by a lack of emphasis on pixels near the center of the overall window. If we consider all errors in the image, rather
than just gross errors, the original method actually has an 11.5% higher error rate than the modified
method. The average accuracy for the two methods, when non-gross errors are included, is compared in Figure B.27. Clearly the modified version has far fewer errors overall than the original
SMW method.
Although not proposed in the original SMW description, we can further enhance the performance of the SMW method with preprocessing, such as the LoG filter or rank transform.
Figures B.28 and B.29 show the accuracy of the modified SMW method when a LoG filter
(σ = 1.0) and the rank transform are used as preprocessors, respectively. Both preprocessing
operations improve the results of correlation, with more significant improvement provided by the
rank transform. Even more significant is the effect that the preprocessors have on optimal window

225

(a) Original SMW

(b) Modified SMW

Figure B.26: SMW disparity map qualitative comparison.

70

Correct Disparities (%)

68

66

64

62

60
Modified SMW
Original SMW

58

56

5

7

9

11
13
15
Subwindow Size (Pixels)

17

19

Figure B.27: Original and modified SMW accuracy comparison. The percentages shown include all errors, rather than just gross errors.

226

size. The average optimal window size for our image datasets is reduced from 13, for the modified
SMW method without preprocessing, to 9 when either the LoG filter or rank transform are used.
This is important because the increased computational cost of the preprocessing can be more than
offset by the reduced computational cost of correlation with the smaller window size. This can be
seen by comparing the computational complexity of a general N × N filter for an M × M image,
O(M 2 N 2 ), with the computational complexity of correlation with an N × N window and maximum
disparity search d, O(M 2 N 2 d).
96
94

Correct Disparities (%)

92
90
88
86
84
Tsukuba
Venus
Teddy
Cones
Average

82
80
78

5

7

9

11
13
15
Subwindow Size (Pixels)

17

19

21

Figure B.28: Correlation accuracy versus subwindow size for LoG filter (σ =
1.0) followed by modified SMW.

Multiple Supporting Windows
Following the example of Fusiello et al., Hirschmüler [45] proposed improved multiple
window methods, which shall be referred to as the multiple supporting windows method, or MSW.
The window configurations proposed by Hirschmüler were shown previously in Figure 2.2. In
contrast to the SMW method, Hirschmüler’s method always includes the similarity SAD measure

227

X: 9
Y: 9
Z: 90.17

92
Correct Disparities (%)

90
88
86
84
82
80
78
13

19

11

17
15

9

13
7

Rank Transform Size

11
9

5

7
3

5
3

SMW Subwindow Size

Figure B.29: Average correlation accuracy versus subwindow size for rank
transform followed by modified SMW. The peak of the surface is marked.

of the center window and adds to it the similarity measures of half of the remaining windows,
where the half chosen are those with the best SAD score.
The accuracy of the five and nine-window MSW methods is shown in Figure B.30. The
25-window configuration is not shown, since Hirschmüler showed that it tended to perform less
well than the other two configurations. Already we see that the MSW method performs slightly
better than the SMW method discussed previously and does so with much smaller subwindows.
In general, there seems to be little actual difference between the maximum average accuracy of
the five-window and nine-window configurations when applied to our datasets. The nine-window
configuration achieves a peak accuracy of 87.23%, compared to 87.17% with the five-window
configuration. However, as might be expected, the five-window configuration requires a larger
subwindow size to achieve the same performance as the nine-window configuration.
With the five-window configuration and NS × NS subwindows, the overall window size,
N5 × N5 , is given by
N5 = 2·NS − 1.

228

(B.19)

This equation is a result of the one-pixel overlap between the four outer subwindows. With the
nine-window configuration, which has no subwindow overlap, the overall window size is given by
N9 = 3·NS .

(B.20)

For the nine-window configuration, the 9 × 9 subwindow size gives an accuracy that is only
marginally better than the 7 × 7 subwindow, but correlates a much larger window overall (27 × 27
instead of 21×21), or 65% more pixels, making it potentially much more expensive. Applied to the
four datasets, the nine-window configuration with a subwindow size of 7 × 7 slightly outperforms
the five-window configuration with a subwindow size of 11 × 11. Both of these configurations
use a total area of 21 × 21 pixels for correlation, although with the five-window configuration
there is significant overlap between the center window and the outer windows. As a result, the
five-window configuration effectively requires the correlation of 5 × 11 × 11 pixels, or 37% more
overall. The computational cost of the five-window method can be reduced, essentially to that of
the nine-window method, if we take advantage of the redundant computations in each window.
In the original paper [45], Hirschmüler recognized the importance of preprocessing prior to
stereo correlation and suggested that the LoG filter be used with this method. Figure B.31 shows
the accuracy of the MSW method when it is preceded by LoG filtering with σ = 1.0.
As with SMW, preprocessing with the LoG filter before applying the MSW method improves the correlation accuracy and reduces the subwindow size required to achieve peak performance. With the LoG filter, the nine-window configuration performs slightly better than the
five-window configuration, requiring a 5 × 5 subwindow to achieve peak accuracy on our datasets,
compared to the optimal 7 × 7 for the five-window configuration. Thus, the overall window size for
the nine-window configuration (15 × 15) is larger than that of the five-window (13 × 13), making
the five-window configuration more computationally competitive with the nine-window when LoG
preprocessing is employed.
Although not proposed by Hirschmüler, we can further improve the performance of the
MSW method with the use of the rank transform instead of the LoG filter as a preprocessor. Figure B.32 shows the performance of the MSW method when preceded by the rank transform. This
improves the correlation accuracy from 89.29% for the LoG→MSW9 combination to 90.39% for

229

100

Correct Disparities (%)

95

90

85

80

Tsukuba
Venus
Teddy
Cones
Average

75

70

65

3

5

7
9
11
Subwindow Size (Pixels)

13

15

(a) Five-Window Configuration

100

Correct Disparities (%)

95

90

85

80

75

Tsukuba
Venus
Teddy
Cones
Average

70

65

3

5

7
9
Subwindow Size (Pixels)

11

13

(b) Nine-Window Configuration

Figure B.30: Correlation accuracy versus MSW subwindow size.

230

100

Correct Disparities (%)

95

90

85

80
Tsukuba
Venus
Teddy
Cones
Average

75

70

65

3

5

7
9
11
Subwindow Size (Pixels)

13

15

(a) Five-Window Configuration

100

Correct Disparities (%)

95

90

85

80
Tsukuba
Venus
Teddy
Cones
Average

75

70

65

3

5

7
9
Subwindow Size (Pixels)

11

13

(b) Nine-Window Configuration

Figure B.31: Correlation accuracy versus subwindow size for LoG filter
followed by MSW.
231

the Rank→MSW9 combination. Unfortunately, this is only marginally better than the Rank→SAD
combination, without any multiple-windowing, which achieves 90.21%. This is due to the inherent ability of the rank transform to deal with disparity discontinuities, which is aided little by the
multiple-window methods that are intended for the same purpose.
In the example of Figure B.32 we again see that the nine-window configuration achieves
its peak performance with a smaller subwindow size than the five-window configuration (5 × 5
compared to 9 × 9). However, the performance of the five-window configuration is much more
stable across various window sizes, suggesting that it might be a more reliable choice in general
when little is known about the characteristics of the stereo image pairs that the system will be
required to correlate.

General Multiple Window
It is clear from the multiple window methods described thus far that a variety of different
multiple-window configurations could be used for correlation, including many that have not been
proposed in the literature. Such methods differ by the number of subwindows, the spacial arrangement of the windows, the relative sizes of the windows, how windows are included in the final
similarity measure, and the relative weighting of the windows.
This naturally leads to the idea of a more general classification of multiple-window correlation methods and the need to characterize the advantages and disadvantages of different window
configurations. Additionally, there is no reason why these multiple-window methods cannot be
combined with the ideas of non-uniform (Section B.3.2) or sparse windows (Section B.3.3). Such
is the approach taken by Lu, Lafruit, and Catthoor [10], who combined the non-uniform, adaptive window weighting of Yoon and Kweon [99] with a multiple-window approach similar to that
of Fusiello. This particular combination results in a rather computationally expensive approach,
although they achieve relatively high performance on a GPU implementation.
The purpose of this discussion is not to propose superior, multiple-window methods, but
only to highlight a largely unexplored area of stereo vision research. Multiple window methods,
such as the MSW method of Hirschmüler, provide improved correlation accuracy without significantly adding to the computational complexity of the method. This is particularly relevant for
custom hardware implementations, which, unlike general-purpose computers, may not be signif232

95

X: 9
Y: 9
Z: 90.37

Correct Disparities (%)

90
85
80
75
70
65
13
11
9

11

7
7

5

Rank Transform Size

3

5

3

13

15

9
MSW Subwindow Size

(a) Five-Window Configuration

Correct Disparities (%)

92
X: 5
Y: 9
Z: 90.39

90
88
86
84
82
13
11
9
7

Rank Transform Size

11

5
3

3

5

13

9
7
MSW Subwindow Size

(b) Nine-Window Configuration

Figure B.32: Average correlation accuracy versus subwindow size for rank
transform followed by MSW. The peak of each surface is marked.

233

icantly affected by the additional overhead imposed by selection of the most appropriate window
subset.

B.3.6

The Census Transform
The census transform was introduced in Section 2.4 and is defined by Equation 2.3. The

census transform is sometimes thought of as a preprocessing step, like the rank transform, since
it transforms the input images into another form prior to correlation. However, as discussed in
Section 2.4, since the census transform results in bit-vectors instead of scalar values, censustransformed pixels cannot be compared using standard correlation techniques (e.g., absolute difference). Instead, we must use a vector distance metric, such as the Hamming distance. This
interdependence of the preprocessing and correlation method leads many to consider stereo correlation with the census transform as a correlation method, rather than a preprocessing step.
By comparing the equations for the rank transform (Equation 2.1) and census transform
(Equation 2.3), we see that the census essentially contains the same information as the rank. However, the census maintains its information in vector form, whereas the rank transform consolidates
the vector via summation into a single scalar value. This difference is significant because keeping
the rank information in vector form allows the census transform to maintain information about
the order in which the pixel intensities occur, greatly increasing the discriminability of the census
transform.
The accuracy of census transform correlation (i.e., the census transform followed by the
sum of Hamming distances, or SHD) is shown in Figure B.33. Figure B.34 shows the cross section
of Figure B.33 for an SHD window size of 13 × 13, and also includes the accuracy for each image
dataset individually.
One important characteristic of the census transform is that, on average, its performance
is remarkably constant for different census transform window sizes, having much more stability
than correlation with the rank transform or other correlation methods. For the four datasets, peak
average performance is achieved with a census transform size of 7 × 7, with an average accuracy of
90.64%, but reduction to a transform size of 5 × 5 reduces the accuracy only to 90.56%. Increasing
the census transform size to 11 × 11 decreases the accuracy only to 90.44%. This stability makes

234

X: 7
Y: 13
Z: 90.64

Correct Disparities (%)

95
90
85
80
75
70
23

21

11
19

17

9
15

13

7

11

9

SHD Window Size

7

5
5

3

Census Transform Size

3

Figure B.33: Average correlation accuracy versus census transform size and
SHD window size. The peak of the surface is marked.

100

Correct Disparities (%)

95

90

85
Tsukuba
Venus
Teddy
Cones
Average

80

75

3

5

7
9
Census Transform Size

11

Figure B.34: Correlation accuracy versus census transform size for 13 × 13 SHD.

235

the census transform far less sensitive to its parameters, providing very good results for a wide
range of image datasets.
The more critical factor is the SHD correlation window size, although this is also relatively
stable beyond a window size of 11 × 11. For example, maximum correlation accuracy is achieved
on our datasets with a window size of 13 × 13, providing an accuracy of 90.64%. Increasing the
SHD window size to 21 × 21 lowers the accuracy only to 89.54%, or a reduction of about 1.2%.
This is more stable than traditional SAD correlation, which sees a reduction of about 1.6% from
its optimal window size of 13 × 13 to 21 × 21.
Of all the correlation methods presented thus far, the census transform gives the highest
average accuracy on the four datasets and also delivers the most stable performance over its range
of parameters. Although the transform was first introduced some time ago, it has seen little use
by the stereo vision community. The reason for this neglect is due to the poor performance of
SHD correlation when running on general-purpose computers. The most popular general-purpose
computers, based on the Intel architecture, historically have not had an instruction for computing
the Hamming distance. The naive implementation of the Hamming distance calculation has an
execution time proportional to the bit-vector length. This effectively increases the computational
complexity of census correlation from the traditional O(M 2 N 2 d) to O(M 2 N 2 dh), where h is the
length of the census transform bit-vector. An improved version of the Hamming distance calculation algorithm has a performance proportional to the Hamming distance. Other variations may
include the use of lookup tables to avoid most of the computation entirely.
These performance restrictions do not apply to the census transform and SHD correlation
when performed using custom hardware. This is because custom hardware allows us to create
the hardware structures necessary to compute the Hamming distance directly, rather than using
a larger set of more general operators to implement it. This makes the census transform and
SHD correlation much better suited to implementation using custom hardware. Furthermore, a
number of optimizations for the census transform and SHD correlation are proposed in Chapters 45, decreasing the hardware resources required.
There may yet be a future for the census transform and SHD correlation on general-purpose
computers. In September of 2006, Intel announced the introduction of the SSE4 instruction
set [100], [101]. This instruction set consists of 54 new instruction for the Intel 64 ISA. These
236

instructions are subdivided into two subsets. SSE4.1, consisting of 47 instructions, was first introduced in the Intel microarchitecture codenamed Penryn. SSE4.2, consisting of the remaining 7 instructions, first become available in the Nehalem microarchitecture, which was released
in November of 2008. Among the new instructions that became available with SSE4.2 is the
POPCNT (or population count) instruction, which counts the number of bits set in a number up to
64 bits in length. This instruction will dramatically improve the performance of SHD correlation,
allowing each Hamming distance to be computed with just two instructions (XOR→POPCNT).
Strictly speaking, this instruction is not SIMD capable, operating on a single register only. However, it is possible to combine multiple census bit-vectors into a single register. Thus, a single
64-bit register can be used to compute the Hamming distance on four 16-bit census vectors simultaneously, giving a 4x performance improvement. With 8-bit census vectors, 8x improvement can
be achieved. Small census vectors such as these are introduced in Chapter 4. Thus, the census
correlation method is ready for a new performance evaluation on general-purpose computers since
SSE4.2 is now available.

B.3.7

Summary of Correlation Measures
In Section B.2, we introduced several common preprocessing methods. The two most

important preprocessing methods are the LoG filter and rank transform. Both of these provide
significant improvement to the correlation accuracy using the SAD similarity metric. Adding to
that section on preprocessing methods, this section has introduced several similarity measures,
as well as computationally efficient windowing methods, and combined them with appropriate
preprocessing methods to demonstrate the combined effect.
Of the classical similarity measures, NCC seems to be superior in general. However, its
computational complexity has led many to use simpler alternatives. The SAD similarity measure is
far simpler, and generally performs better than SSD. It was also shown that SAD performs as well,
or better, than NCC when noise is present in the input images. The accuracy of SAD combined
with its computational efficiency makes it an ideal choice as a real-time similarity measure. It is for
this reason that SAD has been used in this dissertation as a baseline for performance comparison.
Two variations of standard area-based correlation methods were proposed, including nonuniform correlation windows and sparse correlation windows. The former was intended to improve
237

the correlation accuracy, but showed only negligible improvements relative to standard square
correlation windows, further validating the use of a square, uniformly-weighted window for correlation. The latter is an optimization intended to increase the computational efficiency of the
correlation, without significantly penalizing correlation accuracy. It was found that reducing the
density of the correlation window by as much as 50% had only a minor effect on the correlation
accuracy. Although this optimization is not very promising for implementation on general-purpose
computers, due to the added overhead of computing a sparse similarity measure, it could reduce
the hardware resources required for some custom logic implementations.
The computational efficiency of standard correlation methods can be further improved on
custom hardware through the use of reduced image data widths. With a small reduction in data
width (e.g., 8-bit to 7-bit), the effect on correlation accuracy is essentially negligible for sufficiently
large window sizes.
We can further improve upon the correlation accuracy of SAD by applying multiple window methods, such as AW, SMW, and MSW. These window methods seek to improve the correlation accuracy near disparity discontinuities, which are a common source of errors in area-based
stereo vision. The SMW method as well as a proposed modified version of SMW were evaluated,
with the original SMW having slightly fewer gross errors, but the modified SMW having much
fewer errors overall. I also proposed the combination of the SMW methods with the LoG preprocessing filter and the rank transform, improving correlation accuracy by about 2.7% and 4.8%
respectively. These proposed combinations also reduced the optimal subwindow size from 13 to 9,
dramatically decreasing the requirements of the SMW method.
The MSW method, however, delivers higher accuracy and is much more computationally
efficient, making the other multiple-window methods largely irrelevant. Two variations of the
MSW method were shown, which provide similar accuracy, but with the nine-window configuration usually performing marginally better. Computational complexity is also generally similar for
both window configurations, depending on the window sizes chosen. The combination of the rank
transform with the MSW method was also proposed, further improving the accuracy of the MSW
method. Interestingly, although the MSW method provides significant improvement to standard
SAD and the LoG→SAD combination, it provides relatively little improvement to Rank→SAD.

238

Table B.5: Correlation Method Average Accuracy Summary
Method
SAD
SMW
ModSMW
MSW5
MSW9
LoG→SAD
LoG→ModSMW
LoG→MSW5
LoG→MSW9
Rank→SAD
Rank→MSW5
Rank→MSW9
Census→SHD

Optimal Parameters
13 × 13 Window
33 × 33 Window (17 × 17 Subwindow)
25 × 25 Window (13 × 13 Subwindow)
21 × 21 Window (11 × 11 Subwindow)
21 × 21 Window (7 × 7 Subwindow)
11 × 11 Window, 7 × 7 LoG, σ ≈ 1.0
13 × 13 Window (7 × 7 Sub.), σ = 1.0
13 × 13 Window (7 × 7 Sub.), σ = 1.0
15 × 15 Window (5 × 5 Sub.), σ = 1.0
17 × 17 Window, 7 × 7 Rank
17 × 17 Window (9 × 9 Subwindow)
15 × 15 Window (5 × 5 Subwindow)
13 × 13 Window, 7 × 7 Census

Accuracy (%)
85.72
86.70
86.02
87.17
87.23
87.63
88.38
89.11
89.81
90.21
90.37
90.39
90.64

This is due to the rank transform’s inherent immunity to disparity discontinuities as compared to
standard SAD.
The census transform was also evaluated in detail, and showed the best correlation results
overall. Unfortunately, the census transform can have a relatively high computational cost, particularly when employed on computing platforms that do not have direct support for computing
the Hamming distance. Therefore, it would be particularly advantageous to reduce the hardware
requirements for the implementation of the census method in order to make it more efficient and
cost effective. Such optimizations are discussed in Chapters 4–5.
The average correlation accuracy of the various correlation methods studied is shown in
Table B.5. Only the parameters giving the best results on the four datasets are shown. The methods
are listed in ascending order of correlation accuracy. Variations involving optimizations for the
purpose of reducing computational cost at the expense of accuracy are not shown, but can be
inferred from the data elsewhere in the chapter.

B.4

Post-Processing Methods
Several post-processing methods have been proposed to improve the accuracy of the dis-

parity map produced by stereo correlation. Generally speaking, the purpose of most stereo vision
239

post-processing is to compute measures that estimate the validity of each stereo match, so that
disparity estimates with a low probability of being correct can be identified and eliminated. These
validity measures may attempt to identify invalid disparity estimates in a variety of ways, including
the following [42]:
• Left-right consistency checking. This involves reversing the roles of the stereo cameras
to generate a new disparity map and identifying disparity estimates that are inconsistent
between the two maps. This technique was introduced in Section 2.4.
• Identification of textureless areas. Textureless areas often occur in images as regions of
fairly constant color, such as shaded regions or smooth, uniformly-colored objects. The lack
of identifiable texture makes pixels in these regions particularly difficult to match. An error
filter may be designed to identify these areas of low texture and discard matches that fall
within these regions.
• Insufficient value of the match score. Similarity measures can be thresholded in a variety of
ways in order to throw out matches with a low similarity.
• Identification of locally anomalous disparities. This involves the identification of disparity
estimates that vary significantly from their immediate neighbors.
• Identification of ambiguous matches. When correlating a given pixel, the two most likely
candidates sometimes have similarity scores that are numerically close. In such a case, it is
not clear which match is actually the correct one. These ambiguous matches can be rejected,
reducing the chance of an incorrect match.
These error reduction concepts can often be applied as independent post-processing steps
that occur after correlation is complete. Also common is the integration of such error reduction
processing with the correlation step. This often allows erroneous matches to be identified and
eliminated earlier while potentially taking advantage of the image data already needed for the
correlation step, reducing bandwidth and resource requirements for the implementation.
A variety of error reduction methods have been proposed in the literature. Of all the current
error reduction methods, the left-right check has been repeatedly identified as a superior error
240

filter [41], [42], [45]. Despite its seemingly high cost, it has also been proposed as a robust error
filter suitable for real-time stereo vision implementations [38], [93]. Fortunately, the computational
cost can be reduced by reusing data computed in the correlation step.
We will first evaluate to what extent the left-right consistency check (henceforth, LRCC)
improves correspondence by comparing implementations of SAD with and without the check.
Figure B.35 shows the correlation accuracy (i.e., percentage correct of the assigned disparities, or
PCA, as defined in Section B.1.3) for the images of our dataset with and without LRCC in separate
plots, side by side. We immediately see that there is a dramatic improvement in the effective
accuracy of the correlation with LRCC, since it removes a significant portion of the incorrect
matches. Additionally, like many stereo correlation improvements, the LRCC compensates for
the need of a large correlation window, allowing peak performance to be achieved with a smaller
window. For example, we see that the optimal window size is reduced from 13 × 13 to 9 × 9.
Given that the computational complexity of stereo correlation is O(M 2 N 2 d) with an N × N window
size, such a reduction can more than compensate for the additional complexity incurred by the
consistency check. This is because, ideally, the LRCC requires no additional computation but only

100

100

95

95

90

90
Correct Disparities (%)

Correct Disparities (%)

comparison of the existing results, having complexity O(M 2 d).

85
80
75
Tsukuba
Venus
Teddy
Cones
Average

70
65
60

3

5

7

9
11
13
15
Window Size (Pixels)

17

19

85
80
75
Tsukuba
Venus
Teddy
Cones
Average

70
65
60

21

(a) SAD

3

5

7

9
11
13
15
Window Size (Pixels)

17

(b) SAD with LRCC

Figure B.35: SAD with and without left-right consistency check, correlation
accuracy (PCA) comparison.

241

19

21

SAD, by itself, is one of the worst performing correlation methods, so we would expect to
see a lot of rejected pixels in the final disparity map. Figure B.36 allows for a qualitative comparison of the SAD disparity results for the Teddy dataset with and without LRCC. In Figure B.36(b),
disparity matches invalidated by the LRCC have been marked in red.
One of the LRCC’s strengths is its ability to eliminate matches for regions which are occluded in one of the input images. In Figure B.36(b) we see red borders along object boundaries,
indicating that the regions occluded in one image were rejected. There is also a large band of
rejected pixels along the left edge of the image. This is due to the points along the left edge of the
left input image that were not visible in the right camera view. We also see a high rate of rejection
along the roof of the birdhouse, which SAD correlation has difficulty matching. Finally, we see a
small number of rejections at points in the image where the disparity value changes incrementally.
These rejections are due primarily to the low discriminability of adjacent pixels. It is possible, and
perhaps desirable for some applications, to not filter these latter mismatches simply by modifying
the LRCC so that it rejects disparity estimates only if they disagree by more than one pixel between the two views. However, this optimization comes at the expense of other classes of incorrect
matches then being accepted, slightly lowering the overall accuracy of LRCC.

(a) SAD Disparity Map

(b) SAD with LRCC Disparity Map

Figure B.36: Teddy dataset SAD disparity map with and without LRCC. Invalidated matches are highlighted in red.

242

The dramatically improved accuracy of correlation with LRCC comes at the expense of a
lower disparity estimate density (DED) and an increase in computational cost. Additionally, some
of the rejected disparity estimates were in fact correct, lowering the raw number of correct disparity
matches, and further decreasing the DED.
Figure B.37 shows the average accuracy of SAD, the average accuracy of SAD with LRCC,
and the disparity estimate density (DED) on the same plot. As the correlation window size is increased, the accuracy with LRCC improves then begins to slowly decrease, whereas the DED tends
to continually increase. Therefore, we must inevitably choose a balance between the correlation
accuracy and the density of the disparity map.

100
95

Percentage (%)

90
85
80
75
70
PCA
PCA w/LRCC
DED w/LRCC

65
60

3

5

7

9

11
13
Window Size

15

17

19

21

Figure B.37: Original SAD and SAD with left-right consistency check, average
correlation accuracy and disparity estimate density.

We can further improve upon these results by combining the LRCC with the various other
methods studied in this work. The results for several promising correlation methods are shown in
Figure B.38. Due to the relative similarity of the results in these figures, the best results for each
method are shown in Table B.6.

243

96

94

94

92

92

90

90

Percentage (%)

Percentage (%)

96

88
86
84

86
84

PCA
PCA w/LRCC
DED w/LRCC

82
80

88

3

5

7

9

11
13
Window Size

15

17

19

PCA
PCA w/LRCC
DED w/LRCC

82
80

21

3

5

94

94

92

92

90

90

Percentage (%)

96

88
86
84

13

88
86
84

PCA
PCA w/LRCC
DED w/LRCC

82
80

11

(b) LoG(σ = 1.0) →MSW9

96

3

5

7

9

11
13
Window Size

15

17

19

PCA
PCA w/LRCC
DED w/LRCC

82
80

21

3

5

(c) Rank(7 × 7) →SAD

7
9
Subwindow Size

11

(d) Rank(7 × 7) →MSW9

96
94
92
Percentage (%)

Percentage (%)

(a) LoG(σ = 1.0) →SAD

7
9
Subwindow Size

90
88
86
84
PCA
PCA w/LRCC
DED w/LRCC

82
80

3

5

7

9

11
13
Window Size

15

17

19

21

(e) Census(9 × 9)

Figure B.38: Comparison of LRCC effectiveness for various correlation methods and their window sizes.
244

13

Table B.6: Comparison of LRCC Effectiveness for Various Methods
Method
LoG→SAD
LoG→MSW9
Rank→SAD
Rank→MSW9
Census

Optimal Parameters
7 × 7 Window, σ = 1.0
15 × 15 Window, σ = 1.0
11 × 11 Window, 7 × 7 Rank
15 × 15 Window, 5 × 5 Rank
7 × 7 Window, 9 × 9 Census

PCA (%)
93.90
95.35
94.91
94.78
95.20

DED (%)
85.08
89.02
89.12
90.13
88.79

Surprisingly, when the LRCC is added, LoG→MSW9 gives the highest peak correlation
accuracy, even though it does not perform as well as the rank or census correlation methods without LRCC. However, its superiority is marginal at best. More importantly, the performance of
LoG→MSW9 varies significantly with the parameters and image datasets used, making the combination less robust in general. Both census and rank→SAD provide very good results and show
very good stability. From a computational efficiency standpoint, the rank and census methods are
also superior since they require much smaller correlation window sizes.
One noteworthy characteristic of good error filters is that they tend to equalize the overall
correlation accuracy of different methods by eliminating most of the errors. As a result, we see
that nearly all of the correlation methods of Figure B.38 have a similar peak PCA when LRCC is
used. However, methods that correlate more poorly will necessarily have more errors removed by
the error filter to achieve this accuracy, resulting in a lower DED. Thus we see that the methods
with the highest correlation accuracy prior to the LRCC result in higher DED curves on average.
Even when the best correlation methods suitable for real-time implementation are combined with LRCC, about 5% of the disparity matches are still incorrect in the resulting disparity
map. Since disparity errors can be detrimental to the system using the disparity estimates, it is
important to consider the nature of the errors that do remain. Fortunately, it turns out that most of
the remaining errors in the four datasets are not critical.
For example, Figure B.39(b) shows the gross errors that remain in the Teddy image, which
is the most challenging of the four datasets, after 9 × 9 census, 7 × 7 SHD, and LRCC have been
performed. The vast majority of the errors that remain occur at the disparity discontinuities. By
comparing Figure B.39(a), we can see that for these border errors the correlation has mistakenly
assigned the pixel of the foreground or the background object. Generally, such errors are not
245

significant since they only expand or shrink the foreground object by a small amount. The only
significant errors that remain in the Teddy disparity map are those that occur for the chart in the
background behind the teddy bear. These errors are due to the highly repetitive texture of the
chart, or the textureless regions of the chart, which cause disparity ambiguities. The addition of
another error filter which identifies and removes these ambiguous matches may serve to improve
the accuracy of the disparity map even further.

(a)

(b)

Figure B.39: Teddy image dataset disparity images after running LRCC. Correlation
performed using 9 × 9 census and 7 × 7 SHD. (a) Disparity image. (b) Disparity
image with gross errors highlighted in red.

246

APPENDIX C.

STEREO VISION COMPONENTS

Chapter 6 introduced the hardware implementation of the heart of the stereo system, which
is the correlation engine. This chapter provides additional details on the hardware design of specific
components of the stereo system, some of which are optional. Section C.1 focuses on image rectification, which is commonly used to precisely align the images before passing them to the stereo
system. Section C.2 describes how neighborhood operations (required for most preprocessing filters) can be efficiently implemented in custom hardware. Section C.3 discusses the preprocessing
architecture and describes some of the hardware resource implications. Sections C.4 and C.5 introduce how the left-right consistency check (LRCC) can be implemented in custom hardware and
how it can be efficiently pipelined. Finally, Section C.6 describes how the stereo vision core can
fit within an entire computer system, such as the Helios Robotic Vision Platform.

C.1

Image Rectification
One important step present in many stereo vision implementations is image rectification,

which was introduced in Section A.5. Generally considered a preprocessing step, the rectification
of the input images has been, for the most part, assumed throughout this work. It has not been described in detail since it is described in depth in several texts (e.g., [90], [92]) and several examples
of hardware implementation can be found in the literature (e.g., [76], [102]).
The rectification processing hardware generally requires a relatively large image buffer, the
size of which depends on the degree of misalignment between cameras and the characteristics of
each camera. An external SRAM, such as the one found on Helios, is well suited to this purpose. It
may also be possible to use on-chip RAM, depending on the rectification buffer size requirements
for the stereo camera configuration.
The quality of the rectification and the resources required vary with the implementation.
For example, some implementations perform rectification by simply transforming each pixel co-

247

ordinate to a rectified coordinate then reading the nearest pixel based on the transformation [62].
Others use a more sophisticated bilinear or bicubic interpolation for improved results [76], [80].
Yet others avoid the rectification step altogether and carefully align the stereo cameras as best they
can [73]. However, in general, accurate rectification is required for high-quality, stereo correlation.

C.2

Neighborhood Processing
Many image processing operations are implemented using convolution and box filtering

techniques. That is, the output pixel of an image processing operation is usually a function of
some N × N neighborhood around the same pixel location in the source image, as shown in Figure C.1. This is the type of processing that is used when stereo preprocessing is applied to the input
images, including the LoG filter, rank transform, and census transform. The same technique could
be used for the computation of the similarity measures during stereo correlation, although the similarity measures would require the neighborhoods of two pixels, one from each image. However,
summing optimizations are usually used instead to reduce the amount of computation required
(Section A.7).

Input Image

Processed Image

f (p)

Figure C.1: Spatial filter. In this example, the output pixel at a coordinate is a function
of a 3 × 3 neighborhood of pixels around the same coordinate in the input image. The
vector p represents the set of nine pixel values from the 3 × 3 window and f represents
the function applied to p.

248

One of the most fundamental problems then becomes acquiring the N × N neighborhood
of pixels needed for the computation, whether it be a preprocessing filter, a similarity measure, or
some other computation. Conceptually, we can think of these neighborhood processing computations as operating on a sliding window that moves across the image one pixel at a time, from left
to right then top to bottom. This is shown in Figure C.2 for a 3 × 3 window of an image. A typical
3 × 3 filter will require all nine pixels in order to produce the corresponding output pixel. The
window is then slid to the right by one pixel—overlapping by N − 1 columns with the previous
window—to process the next nine pixels.

e11 e12 e13
e21 e22 e23
e31 e32 e33

Figure C.2: 3 × 3 sliding pixel window.

Many strategies could be employed to acquire the needed window of data, and the best
solution will depend on the application requirements. For our purposes, we desire an efficient,
real-time implementation that takes into account the fact that image data is usually delivered from
the camera sequentially, one pixel at a time. This means that, on average, for each pixel input from
camera, we must also output a processed pixel. This will allow us to avoid the need for any extra
memory to be used for the buffering of image data, as is used in many stereo implementations.
An approach well-suited to FPGA implementation and standard digital image sensors is
shown in Figure C.3 [103]. This figure shows the architecture for acquiring a 3 × 3 window of
an image. Although this example is for a 3 × 3 window size, any N × N window can be achieved
similarly by adding additional delay buffers and registers.

249

Pixel Input
Reg

Reg

Delay Buffer

Reg

Reg

Delay Buffer

Reg

e33

e23

e13

e32

e22

Reg

e12

e31

e21

e11

Figure C.3: 3 × 3 window buffer for parallelizing image data.

The delay buffer elements are simply FIFOs that enforce a delay on the input data equal to
the width of the image. That is, for images that are M pixels wide, a pixel input into the delay buffer
will not be output until an additional M pixels have been input. Each delay buffer parallelizes the
data of a given image row with the previous row. The registers on the right side of the figure
parallelize the data horizontally, allowing three pixels of each row to be output simultaneously.
Thus, for each pixel input into the window buffer, a 3 × 3 window is output in parallel. Since the
window output rate is equal to the pixel input rate, this design is optimal in that it achieves the
highest throughput possible while simultaneously using the minimum amount of buffer memory.
In an FPGA implementation, the delay buffers can be implemented using on-chip RAM
blocks with a small amount of external logic. Given that the RAM capacities are fixed, but the
data widths are configurable, a possible optimization is to use a single on-chip RAM to implement
multiple row buffers. For example, a single Xilinx, 18-kbit, RAMB16 cell configured with an 18bit data width can be used to implement both of the delay buffers of Figure C.3 for standard 8-bit,
640 × 480 video. In fact, the same RAM configuration can be used to support two delay buffers
with 9-bit pixels and resolutions up to 1024 pixels wide. Configured with the 36-bit output, the
same RAM can be used to implement four delay buffers, each up to 9 bits wide and 512 pixels
deep.
This brief description overlooks several important practical considerations. For example,
consider what happens when the sliding window reaches the edge of the image. This is illustrated
in Figure C.4. In this situation the 3 × 3 neighborhood output by the circuit of Figure C.3 includes
pixels from both edges of the image, which generally results in invalid filter output. Similar sit250

uations occur for the first and last row of the image. Simple solutions include discarding invalid
pixels, replacing the invalid pixels with black pixels, or simply allowing the corrupted pixels into
the output, which may not be a problem depending on the application. A “valid” signal can also
be associated with the filter output to distinguish valid pixels from corrupted or invalid data.

e11 e12
e13

e21 e22

e23

e31 e32

e33

Figure C.4: 3 × 3 window, image edge overlap.

This type of problem is to be expected in any pipelined, machine vision implementation.
Similar situations occur in the stereo correlation system as well. It is up to the digital hardware
designer to anticipate and deal with these low-level details appropriately when implementing a
system.

C.3

Preprocessing Operations
All of the preprocessing operations described in Section B.2, as well as the census trans-

form, can be implemented using the neighborhood processing technique of Section C.2. All that
is necessary is the addition of the logic that takes the N × N pixel window and computes the desired result. This is illustrated in Figure C.5. The output of each stereo camera will need to be
preprocessed in the same way, usually necessitating two such modules operating in parallel.
Preprocessors that are based on fundamental image processing operations, such as Gaussian smoothing or LoG filters, can be implemented using standard kernels like those shown in
Figure B.9 for the LoG filter. The non-linear filters, such as the median filter, rank transform, and
census transform, can be implemented using simple comparison and addition operators.
251

Pixel Input

e00
e01

N x N Window
Buffer

Neighborhood
Computation

Pixel Output

e(N-1)(N-1)

Figure C.5: Preprocessing operation structure.

The hardware resources of this preprocessing stage can be significantly reduced through
the use of sparse and generalized transform neighborhoods, as proposed in Sections 4.1, 4.2, 5.1,
and 5.2 for the census and rank. For example, the use of a 16-edge transform for a 7 × 7 window
reduces the number of comparisons necessary in the computation for the rank or census transforms
from 48 to 16. For the rank transform, the number of subsequent summations required to determine
the rank value is reduced from 47 to 15. We must also keep in mind that these sparse transforms
also reduce the number of bits required to represent the transformed pixel, significantly reducing
the hardware requirements in the subsequent correlation modules.
This savings is essentially free, since the 16-edge transform tends to perform as well as the
full 48-edge transform in general. Further savings can be achieved with a sparser neighborhood,
accompanied by a degradation in correlation accuracy. The unused outputs of the N × N window
buffer can simply be ignored, generally being removed automatically by circuit optimization CAD
tools, or a simplified version of the window buffer with only the necessary outputs can be designed.
At this point, we can make some qualitative remarks regarding the relative hardware resource requirements for the different preprocessing methods. Based on the analysis in Appendix B,
the LoG, rank, and census are the most promising transformations to apply during the preprocessing step. Coincidentally, the 7×7 filter size gave the best performance for all three of these, making
for relatively easy comparison.
The LoG kernel approximation of Figure B.10(d) gave the best results on average for the
benchmark datasets. If we assume the use of this kernel, which has 33 non-zero coefficients, the
computation of the LoG filter for each pixel requires 32 addition operations. Multiplication is not
required since all coefficients are powers of two. Thus, the multiplications can be performed by
bitwise shifting, which has no cost in a hardware implementation.

252

The standard rank and census transforms require 48 comparisons, each of which has essentially the same hardware cost as an addition operation. Thus, without optimization, both the rank
and census transforms are more expensive than the LoG filter computation.
The rank transform also requires the resulting 48 comparison bits to be summed, generating
a 6-bit result. In other words, the rank value is the population count of the census vector. On an
FPGA, computing the population count of an n-bit vector requires more resources to compute than
an n-bit adder, depending on the implementation and the size of the vector.
If we instead use a 16-edge transform, we reduce the number of comparisons for the census
and rank transforms from 48 to 16. For the rank transform, an additional 16-bit population count
circuit is required. This makes both the rank and census transforms less expensive than the approximated LoG filter. Although the rank transform requires more resources than the census, the
subsequent correlation hardware for the census will be larger due to the larger size of the census
vector (16-bit) compared to the size of the rank transform value (5-bit). This is significant because
the correlation hardware is generally much larger than the preprocessing hardware due to the level
of hardware parallelization applied to the correlation.
Of course, further hardware savings can be achieved by using a sparser census or rank
transform. Additionally, as described in Section 5.1.2, we can use the generalized rank and census
transforms to reduce the required preprocessor window size, reducing the memory required to
implement the window buffer.

C.4

Implementing the Left-Right Consistency Check
The left-right consistency check (LRCC) is one of the most common post-processing steps

used in the literature. Introduced in Section B.4, the LRCC is very effective at eliminating incorrect matches with local stereo methods. An implementation of the LRCC using programmable
hardware has also been described in [71].
In order to implement the LRCC, we require the best match for each pixel in the left image
when compared to the d previous pixels in the right image, as well as the best match for each pixel
in the right image when compared to the d subsequent pixels in the left image. In other words, we
require the resulting disparity map with the right image as the reference image and the disparity
map with the left image as the reference image. As it turns out, the necessary similarity measures
253

are computed for both disparity maps, regardless of which image we choose as the reference for
our implementation.
This can be seen by observing the similarity measure results that are output by each similarity module during each clock cycle of the correlation system’s execution. The timing for the
architecture of Figure 6.1 is shown in Figure C.6, where S(xl , xr ) means the similarity measure being output by the similarity module is for the comparison of the pixel at horizontal offset xl in the
left image with the pixel at horizontal offset xr in the right image. In the figure, the comparison of
a pixel in the left image with five disparities of the right image is highlighted in green (horizontal)
and the comparison of a pixel in the right image with five disparities in the left image is highlighted
in yellow (diagonal).

Iter.
Iter.
Iter.
Iter.
Iter.
Iter.
..
.

0
1
2
3
4
5

Disp. 0 Disp. 1 Disp. 2 Disp. 3
S(0, 0) S(0, −1) S(0, −2) S(0, −3)
S(1, 1)
S(1, 0) S(1, −1) S(1, −2)
S(2, 2)
S(2, 1)
S(2, 0) S(2, −1)
S(3, 3)
S(3, 2)
S(3, 1)
S(3, 0)
S(4, 4)
S(4, 3)
S(4, 2)
S(4, 1)
S(5, 5)
S(5, 4)
S(5, 3)
S(5, 2)
..
..
..
..
.
.
.
.

Disp. 4
S(0, −4)
S(1, −3)
S(2, −2)
S(3, −1)
S(4, 0)
S(5, 1)
..
.

...
...
...
...
...
...
...
..
.

Figure C.6: Correlation system timing.

With the timing shown, the output of the general correlation architecture in Figure 6.1 is
the disparity map with the left image as the reference image (i.e., the best match of the similarity
measures highlighted in green). To obtain the disparity map with the right image as the reference
image (i.e., the best match of the similarity computations highlighted in yellow), we need only
build an additional structure of d − 1 select-best modules. The correlation architecture with this
addition is shown in Figure C.7. Note that for clarity, this figure uses a linear search structure for
the left reference image, but a tree structure would lead to better performance since it minimizes
the critical path of the search.

254

Pixel Distribution Network
Left Pixels

Preprocessing

Right Pixels

Preprocessing

Z -1

Similarity
Module 0

Z -1

Similarity
Module 1

Z -1

Z -1

Similarity
Module (d-1)

Similarity
Module 2

Select
Best L1

Select
Best L2

Select
Best L(d-1)

Select
Best R1

Select
Best R2

Select
Best R(d-1)

Z -1

Z -1

Match for Left
Reference Image

Match for Right
Reference Image

Figure C.7: Correlation architecture with LRCC.

C.5

Pipelining the LRCC
Figure 6.1 suffers from a long critical path in the select-best network that limits the max-

imum clock rate. Additionally, the delays on a large pixel distribution network also have the
potential to limit performance. Therefore, it is desirable to pipeline the architecture, as shown in
Figure 6.2 for high-performance applications.
If we pipeline the distribution network in this way, the timing of the correlation modules
changes from that of Figure C.6, resulting in the timing shown in Figure C.8. Again, the comparison of a pixel in the left image with five pixels in the right image is highlighted in green (diagonal),
and the comparison of a pixel in the right image with five pixels in the left image is highlighted in
yellow (steep diagonal).
In order to obtain the disparity map with the right image as the reference image, for the
purposes of the left-right consistency check, we must adjust the delays to compensate for the
delays in the pixel distribution network, as shown in Figure C.9.

C.6

System Organization
The architectures described in the preceding sections are well suited for implementation on

an FPGA, such as the one found on Helios. Figure C.10 shows a possible system organization for
an FPGA-based computing platform supporting stereo vision.
255

Iter.
Iter.
Iter.
Iter.
Iter.
Iter.
Iter.
Iter.
Iter.
Iter.
..
.

0
1
2
3
4
5
6
7
8
9

Disp. 0
Disp. 1
Disp. 2
Disp. 3
S(0, 0) S(−1, −2) S(−2, −4) S(−3, −6)
S(1, 1)
S(0, −1) S(−1, −3) S(−2, −5)
S(2, 2)
S(1, 0)
S(0, −2) S(−1, −4)
S(3, 3)
S(2, 1)
S(1, −1)
S(0, −3)
S(4, 4)
S(3, 2)
S(2, 0)
S(1, −2)
S(5, 5)
S(4, 3)
S(3, 1)
S(2, −1)
S(6, 6)
S(5, 4)
S(4, 2)
S(3, 0)
S(7, 7)
S(6, 5)
S(5, 3)
S(4, 1)
S(8, 8)
S(7, 6)
S(6, 4)
S(5, 2)
S(9, 9)
S(8, 7)
S(7, 5)
S(6, 3)
..
..
..
..
.
.
.
.

Disp. 4
S(−4, −8)
S(−3, −7)
S(−2, −6)
S(−1, −5)
S(0, −4)
S(1, −3)
S(2, −2)
S(3, −1)
S(4, 0)
S(5, 1)
..
.

...
...
...
...
...
...
...
...
...
...
...
...

Figure C.8: Pipelined correlation system timing.

Pixel Distribution Network
Left Pixels

Right Pixels

Preprocessing

Z

-1

Z -1

Z -1

Preprocessing

Z -2

Z -2

Z -2

Similarity
Module 0

Z -1

Z

-2

Similarity
Module 1

Similarity
Module 2

Select
Best L1

Select
Best L2

Select
Best R1

Z -1

Z

-2

Select
Best R2

Similarity
Module (d-1)

Z -1

Z

-2

Select
Best L(d-1)

Select
Best R(d-1)

Match for Left
Reference Image

Match for Right
Reference Image

Figure C.9: Fully pipelined correlation architecture with LRCC.

In the figure, the raw input images are first sent to the frame synchronization module.
This module serves two purposes. First, it synchronizes the camera data with the FPGA system
clock. Second, it synchronizes the two video frames from the cameras so that the stereo images are
perfectly aligned when they leave the module. The synchronized image frames are then rectified,
with the SRAM being used as the image buffer for both video frames. The stereo correlation
module then performs the remaining preprocessing, correlation, and post-processing. The resulting
disparity maps from the stereo correlation module are then output to a bus interface for high-speed
transmission over the system bus to SDRAM memory.

256

Once in SDRAM memory, high-level vision algorithms running on the PowerPC can process and interpret the resulting disparity map, based on the application. Such a system organization
is appropriate for a variety of applications, ranging from autonomous navigation for ground robots
to real-time aids for the visually impaired.
One worthy addition would be the capability of transmitting the raw camera images instead
of or in addition to the disparity maps. The ability to view the raw images would be essential
for calibration, which determines the parameters of the rectification, and would likely prove very
useful for debugging purposes.

Camera

Stereo
Correlation

IP Bus
Interface
DMA

Memory
Controller

64-bit Processor Local Bus (PLB)

PLB to OPB
Bridge

FPGA

Figure C.10: Helios system organization for stereo vision.

257

PowerPC
Processor

APU

Rectification

SDRAM

FPU

Camera

Frame
Synch.

SRAM

258

APPENDIX D.

THE HELIOS ROBOTIC VISION PLATFORM

The Helios Robotic Vision Platform is an expandable, single-board computer (SBC) intended for real-time vision processing in small systems, such as small autonomous vehicles. The
core of this computing platform is an FPGA with one or more on-chip, embedded, general-purpose
RISC processors.
Although many computing platforms exist that could have been used as the computing platform for Helios, as discussed in Section 2.2, this work explores the use of FPGAs with tightly coupled general-purpose processors. This decision is justified primarily by the qualitative arguments
provided in Section 2.3 and the successful work of numerous researchers cited in this dissertation.
The design for Helios was the result of research and experience using on-board FPGAs for
real-time vision processing on small mobile robots in the Robotic Vision Laboratory at Brigham
Young University [104]–[106]. Although intended specifically for real-time vision applications,
the Helios board itself is made up of general-purpose components, such as the integrated generalpurpose processors, the FPGA, and memories. As a result, Helios is actually well suited to a variety
of applications requiring a small, low-power, high-performance computing platform, regardless of
the need for vision processing. Nevertheless, vision processing was a key influence in the design
of the Helios platform.
This chapter will describe the Helios platform in some detail and will discuss many of the
constraints that influenced the final design. The design of a board such as Helios is a very time
consuming and complex process involving a large number of interrelated trade-offs. As a result,
this chapter will not discuss all aspects of the design in detail, but instead will only highlight some
of the key engineering decisions and design issues related to the design of Helios.

259

D.1

Related Work
The literature is rich in examples of research projects focused on integrating machine vi-

sion into small vehicles. Much of this research has taken the approach of simply mounting one or
more cameras on the vehicle and somehow transmitting the video to a base station for processing.
This approach is especially popular with small aerial vehicles because it circumvents the many
difficulties related to putting real-time, image processing computers on these vehicles, which have
limited payload and power supply capacity. For example, the work of Ettinger et al. [107] investigated the use of vision systems for horizon detection in micro aerial vehicles (MAVs). Their
small, fixed-winged aircraft had a camera and a wireless transmitter that would transmit video
data to the ground where it was processed by a personal computer (PC). The work of Ruffier and
Franceschini [108] investigated an optical flow technique for use in automatic takeoff and landing
as well as terrain following and wind correction. In their work, a small helicopter was tethered
to the ground where real-time processing was performed using a PC equipped with a DSP board.
More recent work with aerial vehicles has been done by the Magicc Laboratory at Brigham Young
University. Most of their work involves small, fixed-wing UAVs with a CCD camera, mounted
on a movable gimbal, and a wireless video transmitter. Video to be processed for machine vision
applications is transmitted to the ground for processing on a PC [109].
Some vehicles, particularly ground vehicles, have also been developed where the vision
processing system was placed on-board the vehicle. The most common approach is to use a vehicle
that is large enough to support the weight of a notebook or other small computer. For example, the
work of Takezawa and Dissanayake [110] on the simultaneous localization and mapping problem
used a Pioneer robot and a trinocular camera from Point Grey Research, along with the Triclops
SDK running on a PC mounted on the robot. Larger vehicles, such as those used in the DARPA
Grand Challenge, generally use the same approach.
Some efforts have also been made to put real-time, vision processing systems on vehicles
too small for the mounting of a PC. One of the earliest attempts to miniaturize a machine vision
system was by Konolige [69]. He developed a stereo vision system, called the SVM (Small Vision
Module) using an Analog Devices ADSP 2181 digital signal processor and two 320 × 240 CMOS
image sensors. The system was able to achieve a typical processing rate of 8 fps on 160 × 120
images. This work was later followed with the SVM II which had a much more powerful DSP.
260

Mahlknecht et al. developed the Tinyphoon mobile robot [111], which used an on-board Analog
Devices Blackfin DSP to implement an image recognition system for robot soccer. Research has
also been focused on the development of custom ASICs for biologically inspired vision systems on
very small vehicles. The work of Wood et al. [112] used the Centeye MG1 optical flow sensor and
two PIC microcontrollers to create a tiny autonomous aircraft with primitive vision capabilities.
In the literature, we also find examples of more closely related work involving FPGAs
for robotic vision. For example, the work of Yamada et al. [113] consisted of a small, hovering, four-rotor aircraft with four CIF resolution cameras that were used to detect motion for aircraft stabilization. The computing platform was implemented primarily using a Xilinx Virtex-II
XC2V1500 FPGA, which executed an optical flow algorithm using multiple cameras as input. In
all, the primary FPGA board consumed only 4.32 W after configuration. Another vision processing system implemented using FPGAs on a small vehicle is the trinocular vision system of Jia et
al. [80], on which a single Xilinx Virtex-II XC2V2000 FPGA was used to implement the vision
system. They then mounted this system on a Lynx Hexapod robot to investigate autonomous robot
navigation [114].
More recently, Kim et al. developed a real-time, stereo vision system consisting of a Xilinx
Virtex-II XC2V6000 FPGA and an Intel XScale PXA270 embedded processor [115]. Their system
is architecturally quite similar to Helios, consisting of an FPGA tightly coupled to a low-power,
general-purpose processor. However, their system was designed specifically for stereo vision and
they made no attempt to miniaturize the hardware, which consisted of large development boards.
Kolar et al. also proposed the use of an FPGA coupled with a general-purpose embedded processor [116]. Their system used a Xilinx Virtex-II Pro XC2VP30 FPGA, with an on-chip, embedded
PowerPC processor, for use in real-time 3D reconstruction. Another similar architecture is found in
the PASTA (Power Aware Sensing, Tracking, and Analysis) project [117]. The PASTA Microsensor system consisted of a set of stackable circuit boards, allowing the user to combine different
modules for different applications. The primary module is a processor module with an Intel XScale PXA255 embedded processor. Other available modules included a power and I/O module, a
compact flash module, an A/D converter board, and an FPGA coprocessor module with a Xilinx
Virtex-II XC2V3000 FPGA, 2 MB of synchronous ZBT SRAM, and 32 MB of SDRAM. The
PASTA Microsensor was designed to investigate power-efficient systems for unattended ground
261

sensor applications. Yet when the FPGA module is included, the architecture is very similar to that
of Helios.
The essential elements of Helios can be found scattered in examples throughout the literature, but to my knowledge no research group has attempted to develop a general-purpose vision
processing platform based on FPGA technology for use on small autonomous vehicles. In fact,
few attempts have been made to put vision on very small vehicles at all due to the lack of a computing platform with the performance, small size, and low power suitable for mounting on such
vehicles. The closest example of a possible solution is the PASTA Microsensor, although even this
system was intended for an entirely different purpose and, as a result, has some unnecessary and
undesirable features. Most other systems have been designed for very specific vision applications
and lack the flexibility to be applied to many other applications. This lack of a small, low-power,
general-purpose, embedded vision platform set the stage for the research and development of Helios.

D.2

Design Constraints
Many FPGA boards are readily available on the market. However, at the time Helios was

being planned, no commercial board existed with the desired characteristics needed to meet the
constraints of real-time vision processing on very small systems, such as small autonomous vehicles. Most readily available FPGA boards were large, power hungry development boards intended
to assist in the evaluation of FPGAs and related technologies. Smaller boards were very expensive
and were generally designed for specific applications, making them largely unsuitable as academic
research platforms for machine vision.
A variety of characteristics can make a given FPGA board unsuitable for the applications
targeted by Helios. This section will provide an overview of the key design constraints that influenced the design of Helios and made it suitable for real-time vision processing in small systems.

D.2.1

Performance
Real-time image processing and machine vision algorithms are well known for their high

computational demand, making performance one of the most important factors influencing the

262

design of Helios. Helios would therefore need a relatively large and high performance FPGA,
providing sufficient programmable resources to implement a variety of demanding machine vision
algorithms. However, physical size, power consumption, and cost would prohibit the use of the
very largest FPGAs.
Similarly, the general-purpose processor would likely need performance much greater than
that of a typical embedded microcontroller. However, the size and power requirements would prohibit the use of non-embedded processors, such as typical workstation, desktop, and even notebook
computer processors.
In addition to the FPGA and processor, the memories employed would need to provide sufficient bandwidth for real-time buffering of one or more video streams, as required by many machine vision algorithms. Some vision algorithms would even require a single image to be buffered
and read multiple times at different stages in the algorithm.
As with nearly all engineering decisions, the selection of the FPGA, processor, and memories must be a compromise between performance, size, power, cost, and availability. The actual
selection of these components will be discussed in detail in Section D.4.

D.2.2

Flexibility and Expandability
Research in the Robotic Vision Laboratory involves many different kinds of vision algo-

rithms. Designing a high performance computing board specifically for each algorithm or application would be a very expensive and time-consuming undertaking. Instead, Helios would need to
be a very flexible and expandable platform that would allow us to explore and apply a variety of
machine vision algorithms.
The FPGA and on-board, general-purpose processors are key to this flexibility. Both are
reconfigurable or programmable, allowing their functionality to be changed as needed for the application. Additionally, an FPGA can be configured to connect to virtually any digital interface,
allowing Helios to communicate directly with most digital devices. External circuitry can also
be added, allowing the FPGA or processor to communicate with virtually any digital or analog
component.
Yet this flexibility is not sufficient by itself. Each application has unique needs for different
components and interfaces. Some applications may require servo control, wireless communication,
263

motor control, SPI, I2 C, and other interfaces while other applications may require none of these.
Some applications may need only one camera, while others may need more. Additionally, different
cameras may require different physical connectors and interfaces.
The solution to this problem is to design Helios with only the essential components common to most applications involving machine vision for small, autonomous vehicles. The remaining
functionality and interfaces would be provided by one or more daughter boards that could be connected to Helios using one or more large, fine-pitch, general-purpose I/O connectors. Each application or vehicle platform could then have its own specially designed daughter board having only
the interfaces and devices necessary for that application. Such a daughter board would typically
be much simpler and far less expensive than the Helios board.

D.2.3

Size and Weight
Since one of the targeted applications is small vehicles, Helios would need to be very

small. For our purposes, Helios needed to be small enough and light enough to run autonomously
on vehicles as small as a 1/12th scale R/C (radio control) car chassis or the Zagi wing airframe,
with a 48 in wingspan, used by the Magicc Laboratory at Brigham Young University.
One of the most common form factors for embedded computers is PC/104 and its derivatives [118]. This is a highly modular standard that allows for boards of the same size to be stacked
together and connected by an ISA or PCI bus. Measuring 3.6 in × 3.8 in (91.44 mm × 96.52 mm),
with a board stacking height of 0.6 in (15.24 mm), this form factor is suitable for many embedded
applications, but it is still a bit large for some small vehicles and larger than was necessary for the
essential elements of Helios. Additionally, the need for small size and low power consumption
meant that we would be unable to use standard ISA and PCI frame grabber boards and compatible
cameras, focusing instead on smaller camera modules with proprietary interfaces. This diminished
the need for a standard bus and highlighted the need for general-purpose interfaces.
Through discussion with the Magicc Laboratory and based on our own experience, we
decided that the Helios board should weigh no more than 50 g and have an area no more than
9 cm × 9 cm (3.5 in × 3.5 in), or at least 10% smaller than PC/104, with a stacking height of
1 cm (0.4 in) or less. This would allow us to target all of the robotic platforms used in the Robotic
Vision Laboratory without excessive bulkiness. Using a stacking configuration (see Section D.2.2)
264

would allow us to minimize the overall width and length of a Helios system by adding application
specific components and interfaces to Helios through daughter boards that only increase the size
of Helios vertically.
To help reach these size and weight goals, we would need to choose components in the
smallest packages available. This meant using ball grid array (BGA) packages and fine-pitch
devices whenever possible. This would generally increase the cost of the components and the cost
of assembly, yet this cost would be somewhat offset by the fact that we were creating a single,
highly flexible board that could be used for a variety of applications. Instead of designing several
different boards and having a few of each built for each application, we would design a single board
and have several built for all our applications, thus reducing the per board cost through increased
production quantities.
Another factor that has a significant affect on the size and weight of a Helios system is
power consumption. The power consumption of Helios largely dictates the size and weight of the
battery needed to supply the Helios board and any daughter boards. It also strongly influences
the size and weight of the voltage regulators needed for Helios. As a result, minimizing power
consumption also tends to minimize the size and weight of the system.

D.2.4

Power Consumption
The power consumption would need to be as low as possible for the Helios board, in or-

der to extend the run-time for autonomous operation and to minimize the size and weight of the
batteries and voltage regulators required to power Helios. This means choosing low-power or
low-voltage versions of components wherever possible. It also means designing efficient voltage
regulators to step down the battery power supply voltage to that needed for the on-board components while wasting as little power as possible.
FPGAs tend to be power-hungry devices compared to some other embedded devices. In order to minimize the power consumption of the FPGA, the most power efficient, high-performance
FPGA would need to be selected. As technology has advanced, FPGA vendors have made significant progress in reducing the static and dynamic power consumption of FPGA devices. Therefore,
Helios would need to employ one of the most recent and technologically advanced FPGAs available.
265

The amount of power consumed by an FPGA depends largely on the design programmed
into the FPGA. Therefore, it is not practical to specify the amount of power consumption to be
expected for a specific FPGA model. Additionally, the power budget for Helios will vary with the
application for which it is being used. As a result, it is not practical to specify a power budget for
Helios. Nevertheless, for some of the platforms being used in our research, such as a 1/10th scale
R/C truck chassis and the Zagi UAV airframe, it appeared that Helios should be able to perform
computationally intensive image processing functions while consuming less than 5 W of power.
This power target would allow us to maintain reasonable run times on these platforms.

D.2.5

Cost
Cost is an important factor in almost any application, and the purchase cost of Helios for

use in our research is no exception. Helios would necessarily consist of many technologically
advanced components. Unfortunately, the use of the latest technology is somewhat at odds with
the need to keep costs low.
There are several methods for reducing the costs of a board such as Helios. For example,
the printed circuit board (PCB) is one of the most expensive components, particularly given the
small quantities we were expecting to build. As a result the smaller the PCB, the lower the overall
costs. Additionally, minimizing the number of metal layers in the PCB has a dramatic affect on the
cost of the PCBs (see Figure D.10). Similarly, minimizing the number of board components (i.e.,
the number of chips, capacitors, resistors, and other components to be soldered) reduces the cost
of assembly.
Another method of reducing costs is careful part selection. Some components are commonly used in a variety of embedded applications. Widely used components are produced in high
volumes and therefore generally have a lower purchase price. To take advantage of this, Helios
would need to be built using widely available and commonly used embedded components, rather
than unusual or special-purpose components.

266

D.2.6

Component Availability
Component availability is an important practical concern for low quantity builds, such as

Helios. Manufacturers sell components in reels, tubes, or trays, each containing anywhere from
dozens to hundreds of units. Distributors are not willing to break open these packages to sell a
small number of components unless the component is commonly used and will likely attract a
large number of customers wanting to purchase small quantities. We intended to build a relatively
small number of Helios boards and therefore needed small quantities of most board components.
This further restricts us to the most commonly used components since purchasing a whole tray or
reel is far too expensive.

D.2.7

Compatibility
Compatibility with existing software, hardware, and other resources was also an important

concern. Redesigning all of this infrastructure from scratch would be a significant and error prone
undertaking. Instead, we would need Helios to be compatible with existing development tools,
such as those provided by the FPGA vendor, and other commonly available tools. Additionally,
we would need Helios to be able to use existing and readily available hardware IP modules for the
FPGA, such as memory controllers, system buses, communications interfaces, etc. Components or
technologies that would require significant development or costly investment in proprietary tools
would be avoided.
Additionally, Helios would need to have strong voltage compatibility. Since Helios is intended to be used for a variety of embedded vision applications, it is impossible to know exactly
what power sources will be available to power Helios on the system in question. For this reason,
it is not practical to constrain the power system of Helios to a small voltage range. Instead, Helios
would need to be able to run on a wide range of power sources, such as wall AC/DC converters or
batteries, as well as a wide range of voltages.

267

D.3

Helios System Concept
Based on the constraints described in Section D.2, we were able to develop a rough idea

of the form and features Helios would have. Our initial concept of Helios’ form and features is
described in this section.

D.3.1

The Stacking Concept
One of the most important requirements that would define the form of Helios and how

it would be used is the requirement of flexibility and expandability (Section D.2.2). In order to
satisfy this requirement, Helios would be a stackable system, similar to the PC/104 stack, except
that Helios would not be constrained to a standard bus such as PCI or ISA. Figure D.1 shows
diagrammatically how Helios could allow expansion through the stacking of a daughter board.



 

 

Figure D.1: Helios stacking concept.

The use of stackable daughter boards provides several key advantages. First, the ability
to stack on daughter boards with additional components specific to different applications allows
Helios to be flexible and able to support a wide variety of applications.
Second, it helps to keep Helios smaller, lighter, and more power efficient. A single Helios
board with all of the components we anticipated we would ever need would necessarily be very
268

large. Many of the components would go unused, depending on the application, adding unnecessary weight, size, and power consumption. Using a daughter board tailored to the application
allows us to eliminate unnecessary components. Additionally, by allowing the daughter board to
stack onto Helios vertically, the added components increase the size of Helios a small amount vertically, rather than expanding Helios a large amount horizontally. This minimizes the maximum
dimensions of Helios, allowing it to fit into smaller spaces.
Third, such a stacking arrangement reduces the cost of the overall Helios system. The
application-specific components that make up the daughter boards tend to be simple and inexpensive components, such as headers and other connectors. The simple nature of such daughter
boards makes them much easier to design and assemble, thus reducing costs. Additionally, these
simple boards can be designed with fewer layers, which has a dramatic effect on PCB cost (see
Figure D.10). By moving the application-specific components to a less expensive daughter board,
the more complex and expensive Helios board is reduced in size, making it more cost-effective to
design and manufacture. Additionally, since every Helios system can use the same Helios board,
Helios production volumes can be increased. As will be shown, the cost of each PCB and the cost
of assembly for each board is dramatically reduced if the quantity of boards ordered is increased.
The stackability of Helios need not be restricted to a single daughter board. Instead, it is
possible to design daughter boards that allow multiple boards to be stacked above or below Helios.
However, in most applications, only one daughter board would be necessary. Additionally, the
daughter board would most likely be placed on top of Helios, since the daughter board would be
more likely to have large or vertical connectors and other components that would make stacking
underneath Helios less attractive.

D.3.2

Essential Components
The selection of components to be used in Helios was a complex process. As with most

engineering decisions, the choice of components to use for a particular purpose is a trade off
between various factors, such as performance, power consumption, size, cost, availability, ease of
use, familiarity, and so forth. The selection of the most important components for Helios will be
justified and described in detail in Section D.4. During the design concept stage, we developed
a rough idea of what components Helios would need to have. Much of this was based on our
269

previous work with FPGA development boards. The essential components that were needed for
Helios are listed below.
• FPGA
• General-purpose processor
• Random access memory (RAM)
• Non-volatile memory (NVM)
• Communication interface(s)
• Debug and configuration interface
• Expansion header
• Power supplies
• Simple buttons, switches, and LEDs
Clearly, Helios would need to have an FPGA, since this was the core of the computing platform we decided to use for Helios. Additionally, Helios would require a general-purpose processor
to fulfill the computing roles not well suited to processing with custom hardware. Both the FPGA
and the general-purpose processor would require random access memory (RAM).
Choosing the type, number, and size of the RAM chips would prove to be one of the
most difficult and consequential decisions. When size and power consumption are not significant
concerns, memory modules such as the DIMM (Dual In-line Memory Module) or the smaller
SO-DIMM (Small Outline DIMM) can be used. These modules typically consist of a bank of
anywhere from 4 to 16 memory chips. However, they are far too large to be used on Helios. This
meant that the memory chips would need to be integrated into Helios rather than using standard
modules. In order to keep Helios small, it would also be very important to minimize the number
of memory chips. The RAM options for Helios could be divided into two categories: DRAM
(Dynamic Random Access Memory) and SRAM (Static Random Access Memory). Helios could
use one or more DRAM chips, one or more SRAM chips, or a combination of the two.
270

In additional to RAM, Helios would also require non-volatile memory, which is memory
that retains its data after the power is turned off. This type of memory would be used to store the
software for the general-purpose processor and the configuration of the FPGA, which is typically
a volatile device.
Helios would also need to have basic communication interfaces. At a minimum, Helios
needed to have a serial port interface. This simple interface would be a fundamental data communications link that could be relied upon during early debugging and development stages. We also
felt that Helios would benefit from a standard, high-speed PC interface, such as USB or FireWire.
Such a high-speed interface would allow us to transfer video and other data in real-time from
Helios to a PC for evaluation or debugging purposes. Helios would also need communication interfaces for debugging both the processor and the FPGA, as well as configuring the FPGA. JTAG
(Joint Test Action Group) is the standard interface for such connections on FPGAs.
An expansion header would be the means by which Helios could be expanded with daughter
boards and targeted to a specific application. Efficient voltage regulators would be necessary to
convert the input power source (possibly ranging from 5 V to 28 V) to the various voltages required
by the different devices on Helios and the daughter board.
Finally, a collection of buttons, switches, and LEDs would serve as basic input and output
devices in situations where a PC may not be available to provide more sophisticated user control.
We experimented with a large number of different layout schemes in order to determine
how all of these components would fit together on the Helios board. The final component layout
could not be determined precisely until component selection was completed and the complete
schematic had been drawn up. Figure D.2 shows one of the more mature concept drawings of the
Helios layout. The board and component packages are drawn to scale, based on likely component
selections and standard packages. This drawing shows a version of Helios that is 90 mm × 70 mm.
In the end, we were able to reduce the size to 90 mm × 65 mm, or about 3.5 in × 2.5 in.

D.4

System Component Selection
Selection of the essential system components for Helios was performed based on the con-

straints discussed in Section D.2. This section will discuss in more detail the essential system

271



 
 







#)" 

*+






  







 

"'  & ( 



$






 
 

""#



,+


% 





! 

%
&

Figure D.2: Helios layout concept.

components that make up the Helios board. Attention will be given to the purpose of each component in the overall system and why a specific product type was chosen to fulfill that role.
Much of the component selection was guided by our previous work using FPGA development boards in mobile robots with on-board vision [104]–[106]. For most of this previous research,
we used a Xilinx Virtex-II XC2V1000 FPGA with a MicroBlaze soft processor core running on the
FPGA. This FPGA used a 130 nm process technology and a core voltage of 1.5 V. A MicroBlaze
was used on the FPGA simply because we had no development boards with an external processor. Our systems implemented using the Virtex-II FPGA were typically clocked at 75 MHz, with
the MicroBlaze clocked at the same speed. Based on this previous work, it was clear that Helios
needed to deliver higher performance, allow higher logic capacities, and use less power than our
existing development boards.

D.4.1

General-Purpose Processor
FPGAs, by themselves, are not well suited to the implementation of all aspects of a machine

vision system. Computations such as those related to system control and other processing, which
often do not involve highly regular or signal processing operations, are better suited to implementation on a general-purpose processor (see Section 2.3). As a result, a general-purpose processor
272

(GPP), also commonly called a central processing unit (CPU), was needed to serve as a system
controller.
The GPP could also be used to implement some of the high-level machine vision operations, which tend to be less regular than typical signal processing. As a general rule, it is much
easier to implement processing in software for a GPP than it is to design a custom hardware circuit
to perform the same function. The more complex and irregular the processing, the greater the advantage demonstrated by a software implementation. Therefore, it is desirable to have a processor
with sufficient performance to allow the user to implement some of the high-level machine vision
operations, while reserving the FPGA for more regular or computationally intensive processing.
This approach gives the users of Helios tremendous flexibility. For example, a designer
can implement the entire machine vision algorithm in software and only use custom hardware to
accelerate the system as needed to reach the desired level of real-time performance. Alternatively,
the designer may already know which portions of the algorithm will lend themselves to hardware
implementation, allowing the designer to do the hardware/software partitioning at the outset.
With increased CPU performance comes increased power consumption. The optimal balance between performance and power consumption depends on the application for which Helios
will be used. For some machine vision algorithms, a GPP may not be needed at all. For others, a
high-performance GPP may be a necessity in order to implement complex, real-time behavior. As
a result, no single processor choice will be optimal for Helios. Yet it seems clear that a processor
on the faster end of the embedded processor performance spectrum (e.g., more than 100 DMIPS)
would be very helpful for many applications.

Internal vs. External Processor
The options for processor selection can be divided into two categories: processors that are
external to the FPGA (off-chip) and processors that are internal to the FPGA (on-chip). Each of
these options has various advantages and disadvantages.
Using an external discrete processor has the advantage of a much wider variety of options.
Given that we would like a processor at the higher end of the embedded spectrum, we can, however,
limit our choices to 32-bit processors. Some of the most common 32-bit architectures for embedded applications include the ARM, PowerPC, and MIPS architectures. External discrete processors
273

also have the advantage that they are generally available with a wide variety of integrated peripherals, such as serial, SPI, I2 C, USB, Ethernet, CAN, LCD, PCI, and other capabilities. This reduces
the amount of external components required for many applications and may reduce the amount of
resources needed to be dedicated to these functions on the FPGA. Yet, these extra peripherals, if
unused, can also be a disadvantage in that they waste power, increase the cost of the processor, and
may increase the size of the processor.
Another key disadvantage to an external processor is the added board complexity. The
presence of the external processor alone significantly increases the size of the printed circuit board
(PCB) required and greatly complicates the task of routing board signals. The increased board
complexity may necessitate a PCB with more layers, which can dramatically increase board costs.
In addition, the processor will require one or more power supplies; typically one supply for the
processor’s core and additional power supplies for each I/O signaling voltage. Minimizing the
number of power supplies is key to keeping the board small and inexpensive, and sharing a power
supply with another device, such as the FPGA, is only possible if the two devices use the same
voltage.
A final disadvantage of using an external processor is the necessary coupling between the
processor and the FPGA. In order to allow for efficient communication and data transfer between
the CPU and FPGA, at least a 32-bit data bus would be required. The FPGA would therefore need
more than 32 I/O pins dedicated to this external interface. Supporting such a bus may require an
FPGA in a larger package, which is more expensive and further complicates the PCB design. An
off-chip, 32-bit bus operating at high speed also consumes a significant amount of power.
Using a processor internal to the FPGA solves many of these problems. Most importantly,
integrating the CPU and FPGA into a single package greatly reduces the size and complexity of
the Helios board. Specifically, it simplifies power distribution and minimizes the number of power
supplies needed to power the CPU and the FPGA. Additionally, the peripherals for an internal
processor are completely optional because they are configurable. Only those peripherals necessary
to the application need to be included, and even those that are necessary can be custom tailored and
optimized for the application. Finally, there’s no need for the FPGA to have an external interface
to a CPU, minimizing the number of I/O pins needed, and therefore minimizing the required size
of the FPGA package and the Helios board.
274

Using an internal processor also provides considerable flexibility. The CPU can be tightly
or loosely coupled to the custom hardware of the FPGA, as desired. System peripherals need not
be dedicated to or even shared with the CPU, but instead can be interfaced directly with custom
hardware cores that may or may not have a direct connection to the CPU. Rather than having separate memories dedicated to the CPU and custom hardware, memories can be more easily shared
or different memory types with different timing characteristics can be used for each, depending on
the application.
Using an internal processor also has disadvantages. For example, if the application requires
that the processor have a large number of peripherals, then a large amount of FPGA resources
will likely be consumed to implement them. Additionally, because they’re implemented using
programmable logic, these peripherals will be much less efficient than the hard implementation of
the peripherals in an external CPU, in that they will consume more power and effectively use more
silicon area (see Section 2.3.2). Additionally, these peripherals will likely require the use of I/O
pins on the FPGA, which may necessitate a larger FPGA package.

Internal Processors: Hard vs. Soft Cores
When it comes to the selection of an internal processor, there are again two fundamental
options: a hard processor core or a soft processor core. A hard processor core is one in which
the actual transistor-level design of the CPU is fabricated on the silicon die of the FPGA, right
along with the programmable logic. As a result, the hard processor core has the same benefits
of low power consumption and high performance as any external processor. Additionally, since
the highest performance FPGAs are generally manufactured using the latest technology node (i.e.,
the smallest transistor sizes available) the integrated, hard processor core reaps the benefits of the
latest fabrication technology, whereas many external embedded processors are fabricated using an
older technology node.
A soft processor core is one in which the processor’s logic is implemented using the programmable resources of the FPGA. As a result, the processor uses more power, runs more slowly,
and consumes a significant portion of the FPGA’s programmable logic resources. This performance trade-off is the same as that when deciding between using an FPGA or a custom ASIC for a
circuit implementation (see Section 2.3.2). However, soft processor cores do have the advantage of
275

being highly configurable, meaning that many processor features can be eliminated or fine-tuned
in order to reduce FPGA resource requirements, whereas hard processor cores are fixed.
Fortunately, selection of an internal processor core is simplified by the fact that there are a
relatively small number available for FPGAs. Soft processor cores include the Xilinx MicroBlaze,
the Altera Nios II, and the ARM Cortex-M1. At the time of this writing, the only readily available
hard processor core in a high-performance FPGA is the PowerPC, available in some of the Xilinx
Virtex family of FPGAs.
Comparison of hard processor cores to soft processor cores is complicated by the fact that
soft processor cores are highly configurable. As a result, their performance, power consumption,
and resource requirements vary with the implementation features and parameters chosen. Such
features may include whether the processor supports multiply or divide instructions, how certain
instructions are implemented, and which bus interfaces to include.
To illustrate, let us compare the PowerPC 405 hard processor core in the Xilinx Virtex4 FX20 to the Xilinx MicroBlaze. The PowerPC 405 is a 32-bit RISC processor, with 64-bit data
and instruction bus interfaces, separate 16 kB data and instruction caches, and a performance of
700 DMIPS at the maximum operating frequency of 450 MHz. The MicroBlaze is also a 32-bit
RISC processor, with 32-bit bus interfaces, optional instruction and data caches, and a performance
as high as 184 DMIPS at the maximum operating frequency of 160 MHz on the Virtex-4. Clearly,
the PowerPC is capable of delivering much higher performance than the MicroBlaze even when
the MicroBlaze is configured to include all its performance optimizations.
The MicroBlaze requires a minimum of 525 slices, or about 131 CLBs (Configurable Logic
Blocks), on the Virtex-4 line of FPGAs. However, when configured with typical high-performance
features, such as 32-bit bus interfaces and 16 kB caches, one Xilinx reference design reports that the
MicroBlaze consumed 1,272 slices, or about 318 CLBs, and 16 BRAMs (Block Random Access
memories) [119].
The resource requirements of the hard PowerPC core are a bit difficult to compare to CLB
and BRAM counts, since it does not use configurable logic. However, we can determine the relative
area consumed by the PowerPC block on the FPGA die. On the Virtex-4 FX, the PowerPC 405
block takes up an area of 24 rows by 7 columns plus 2 BRAM columns, where each BRAM is the
height of 4 rows. This is the area normally consumed by 24 · 7 = 168 CLBs plus 24/4 · 2 = 12
276

BRAMs. In other words, the high-performance MicroBlaze system mentioned above uses about
89% more CLBs and 33% more BRAM while delivering 74% lower performance (based on the
DMIPS rating) compared to the PowerPC 405.
Cost is another important factor when choosing a CPU. The effective cost of the MicroBlaze and PowerPC processors will depend on the cost of the FPGA that implements them. The
cost of an FPGA is a function of the silicon area as well as market forces, both of which change
over time. In the Virtex-4 family of FPGAs, the PowerPC is only available on the FX series of
FPGAs whereas the MicroBlaze can be implemented just as easily on the less expensive LX series. Table D.1 shows the part costs as well as the effective CLB costs for nearly every FX and LX
series FPGA in similar logic capacities and packages for the -11 and -12 speed grades. Some parts
are not shown due to the lack of current pricing information, including -12 speed grade parts. The
source for all prices is www.em.avnet.com. All prices are USD for single unit quantities. The costs
of the SX series parts are not shown since they are more expensive than both the FX and LX series
FPGAs. These results show that, on average, each CLB on the LX series costs about 17% less than
on the FX series. Unfortunately, a 17% reduction in CLB cost is far too low to compensate for the
increase in resources required to implement a high performance MicroBlaze processor. Even the
worst case price difference between the FX60 and LX60 represents only about a 26% price reduction. In effect, the monetary cost of the resources required to implement a MicroBlaze system is
much higher than the cost of the PowerPC.
Although the analysis has not been performed, it is reasonable to assume that the hard
PowerPC processor also consumes far less power than the soft MicroBlaze processor, even when
operating at higher frequencies.
The hard PowerPC 405 core delivers much higher performance than a high-performance
instantiation of the soft MicroBlaze core, while consuming far less FPGA area. Thus, it would
seem that a soft processor core is a far more inefficient and expensive choice when compared to a
hard processor core for any application that would require a relatively high-performance CPU.

Choosing the Processor
Considering all of the design constraints, the most important factor regarding processor
selection was minimizing the size and complexity of the Helios PCB, with power consumption
277

Table D.1: Xilinx Virtex-4 LX and FX Cost Per CLB
Part
FX12
FX20
FX40
FX60
FX60
LX15
LX25
LX40
LX60
LX60

Package
FF668C
FF672C
FF1152C
FF672C
FF1152C
FF668C
FF668C
FF1148C
FF668C
FF1148C

Row
64
64
96
128
128
64
96
128
128
128

Col
24
36
52
52
52
24
28
36
52
52

-10
-11
$155.00
$193.75
$247.50
$310.00
$545.00
$682.50
$856.25 $1,070.00
$942.50 $1,178.75
$136.00
$170.00
$263.75
$330.00
$457.50
$574.25
$635.00
$793.75
$698.75
$873.75
FX Average:
LX Average:

-10 ¢/CLB

-11 ¢/CLB

10.09
10.74
10.92
12.86
14.16
8.85
9.81
9.93
9.54
10.50
11.76
9.73

12.61
13.45
13.67
16.08
17.71
11.07
12.28
12.46
11.93
13.13
14.71
12.17

being the second most important criteria. Although performance was also important, it was not
deemed as important since we intended to use custom hardware on the FPGA to implement the
most computationally intensive processing.
Using an internal processor on the FPGA would clearly lead to the smallest and simplest
board design. A soft processor core would likely consume more power while giving lower performance than an external processor. However, we felt that a hard PowerPC core, with the minimum
number of required peripherals, would likely have power consumption similar to that of a typical external processor with similar performance. This conclusion was reached after considering
typical power consumption numbers reported in the data sheets of popular 32-bit processors and
comparing them to power estimates for the PowerPC hard processor core on the Virtex-4 FX line
of FPGAs. A detailed, quantitative comparison is unrealistic since the results will depend on the
number of unused peripherals and features in the external processor, and is therefore application
dependent. Additionally, isolating and calculating the power consumption of peripherals in an
FPGA is also quite difficult, particularly when the level of integration is such that it’s difficult to
tell where the CPU ends and the custom hardware begins.
In the final analysis, compared to typical external 32-bit CPUs, it seemed that the PowerPC
hard processor core would result in a much smaller and simpler board, while consuming a similar
amount of power and delivering similar performance. Although a much faster, external processor

278

could have been used, there was no clear need for additional CPU performance and it was not
clear what the ideal balance was between CPU performance and power consumption. Since the
PowerPC already represented a significant performance improvement over the 75 MHz MicroBlaze
that had been successfully used in our earlier research, it was decided that the on-FPGA, PowerPC
processor was the ideal general-purpose processor for Helios.

D.4.2

FPGA
The selection of the FPGA was perhaps the most important consideration in the design of

Helios. An FPGA was needed that would help keep power consumption low without sacrificing
performance, while providing sufficient capacity for the implementation of anticipated machine
vision algorithms. Fortunately, FPGA vendors have constantly been improving FPGA performance
and capacity, while simultaneously reducing power consumption.

Advancements in FPGA Technology
Particular emphasis has been placed on reducing power consumption in modern FPGAs,
since high power consumption has long been a common complaint, particularly when compared to
a custom ASIC design. Power consumption in a digital device can be divided into two categories,
static (or quiescent) power consumption and dynamic (or active) power consumption. Static power
is the power consumed when no logic switching is taking place and is dominated by the current
leaking through transistors. Dynamic power is the power consumed by the switching of logic (i.e.,
the power consumed by the charging and discharging of transistor gates). The total power is the
sum of static and dynamic power. Historically, static power consumption was considered to be
insignificant since it was so low in CMOS devices. However, over time, as process geometries
have gotten smaller and transistor leakage has increased, static power consumption has become
increasingly significant. In fact, once process geometries shrink below 90 nm, the static power
tends to become larger than the dynamic power, unless steps are taken to reduce transistor leakage.
This is sometimes referred to as the 90 nm inflection point.
To alleviate this problem, current FPGAs employ three oxide thicknesses in the fabrication
of transistors, as opposed to the standard two oxide thicknesses used in traditional processes [120].

279

A thin oxide layer leads to a transistor that is faster but has a higher gate leakage current. A thick
oxide layer leads to a transistor that is slower but has less gate leakage. To take advantage of this,
FPGAs will use a thin oxide layer on performance critical circuits, such as LUTs and flip flops,
and a medium oxide layer on non-performance critical circuits, such as configuration logic and
configuration memory cells. The thickest oxide layer is used in I/O circuits to allow for higher I/O
voltages.
To further reduce power, some FPGAs use a low-K dielectric material between metal layers
to reduce inter-routing-layer capacitance. This reduces the capacitive load, thus reducing power
and increasing performance. Some recent FPGAs, such as Altera’s Stratix III, have even used a programmable back-bias voltage to control the performance/power trade-off of the FPGA logic blocks,
allowing high performance and high power consumption on critical paths but low performance and
low power consumption on non-critical paths [121]. Combined with reduced geometries, these
features have allowed the creation of FPGAs that give a much higher performance/power ratio
than previous FPGAs.

Selecting an FPGA Family
At the time Helios was being designed, Xilinx had recently introduced the Virtex-4 family
of FPGAs. Around the same time, Altera introduced the Stratix II family of FPGAs, their competing 90 nm, high-performance FPGAs. Both the Virtex-4 and Stratix II incorporated some of the
low-power features discussed previously and also benefited from the improved transistor density,
performance, and lower voltage of the 90 nm technology node. As a result, these FPGAs represented a significant improvement over the previous generation of 130 nm FPGAs with regards to
performance, power consumption, and logic capacity. Helios would benefit greatly from the recent
technological improvements in FPGA technology.
Due to a long-standing relationship with Xilinx and extensive familiarity with Xilinx tools,
the Virtex-4 line of FPGAs seemed to be the logical choice for Helios. The Virtex-4 family is
divided into three platforms: LX for general logic, SX for signal processing, and FX for embedded
processing. Each platform has a varying amount of BRAM, DSP slices, PowerPC 405 blocks, and
other resources. The Xilinx Virtex-4 FX was the only line of FPGAs being produced at 90 nm
with embedded hard processor cores. The FX platform also has a large number of BRAMs, DSP
280

blocks, and other resources useful for machine vision processing. Given the perceived benefits of
an embedded CPU (Section D.4.1), the Xilinx Virtex-4 FX line of FPGAs was selected for Helios.
At the time, the Virtex-4 FX was so new that the chips were not even available yet, but engineering
samples were expected to be available by the time the design for Helios was complete.

Selecting an FPGA Package and Capacity
The Virtex-4 FX is available in a wide variety of packages and capacities, as well as three
speed grades and two temperature grades. For Helios, it was important to choose a package that
was as small as possible, in order to reduce board size, but still had enough I/O pins for all memories, communication interfaces, and the expansion header. Additionally, the FPGA had to have
sufficient capacity for a variety of machine vision algorithms.
Table D.2 describes the quantity of features available in different Virtex-4 FX logic capacities [122]. The logic capacity is indicated by the number following FX. For example, the FX12 has
approximately 12,000 logic cells, where the number of logic cells roughly defines the amount of
programmable logic on the FPGA. Table D.3 describes the packages in which each FPGA capacity
is available along with the number of user I/O pins each package provides [123].

Table D.2: Xilinx Virtex-4 FX Model Features
Model
Logic Cells
Slices
BRAM Blocks
BRAM (kbit)
DSP Slices (Multiplier)
PowerPC Processors

FX12
12,312
5,472
36
648
32
1

FX20
19,224
8,544
68
1,224
32
1

FX40
41,904
18,624
144
2,592
48
2

FX60
56,880
25,280
232
4,176
128
2

FX100
94,896
42,176
376
6,768
160
2

FX140
142,128
63,168
552
9,936
192
2

One important factor that influenced the FPGA selection for Helios is the fact that Virtex4 FX FPGAs in the same package are pin compatible. This meant that Helios was not restricted to
using a single FPGA. Instead, Helios could be designed to use any Virtex-4 FX FPGA in a given
package. Since different applications may require different amounts of FPGA logic, and even

281

Table D.3: Xilinx Virtex-4 FX Packages

Package
SF363
FF668
FF672
FF1152
FF1517

FX12
240
320
-

FX20
320
-

User I/O Pins
FX40 FX60 FX100
352
352
448
576
576
768

FX140
768

Size (mm)
17×17
27×27
27×27
35×35
40×40

different numbers of PowerPC processors, this capability was an ideal match for the flexibility
desired for Helios.
In our previous research we had used the Xilinx Virtex-II 1000 FPGA, which had 5,120
slices, 40 multipliers, and 40 RAM blocks. This makes the logic capacity of the Virtex-II FPGA
roughly the same as that of the Virtex-4 FX12, the smallest FX chip available. Since our previous designs already pushed the limits of the Virtex-II 1000 FPGA, Helios would need to support
FPGAs at least as large as the FX20.
If Helios were to support the FX20, it would need to support the FF672 package, a finepitch, flip-chip, BGA package with a 27 mm footprint. As shown in Table D.3, the same package
also supports the FX40 and FX60 FPGAs, both of which are significantly larger than anything we
had used before.
The next available package is the FF1152, which has a 35 mm footprint, making its area
68% larger than the FF672 package. This larger package has more user I/O pins and allows the
larger FX100 FPGA to be used. However, a larger package has several disadvantages.
• A larger package would require that Helios be larger and heavier.
• The FF1152 has 72% more balls that need to be soldered to the PCB. The more balls on the
BGA, the more layers are required for escape routing—the routing of signals on the PCB
from each FPGA ball to the nearest edge of the chip. The escape routing for the FF672
package can be routed on an 8-layer PCB, but the escape routing for the FF1152 would
require 10 or more layers. As shown in Figure D.10, moving from 8 to 10 layers represents
an expensive 51% increase in PCB cost.

282

• The larger package adds about 10% to the FPGA part cost, as shown in Table D.4.
• The FPGA and PCB are the two heaviest components of the board. A larger package and
additional PCB layers would add significantly to the weight of Helios.
• Although the FX100 has a 67% larger logic capacity than the FX60, it cost 130% more (see
Table D.4). It seemed unlikely that we would ever produce a Helios board with such a large
FPGA, particularly in light of the fact that the FX100 has a logic capacity that is ten times
that of the Virtex-II 1000 FPGA we had been using successfully.

Table D.4: Virtex-4 FX Pricing for -10 Speed Grade, Commercial
Temperature Grade (Source: em.avnet.com)
Package
SF363
FF668
FF672
FF1152
FF1517

FX12
$129
$155
-

FX20
$248
-

FX40
$495
$545
-

FX60
$856
$943
-

FX100
$2,171
$2,280

FX140
$4,880

Ultimately, the decision came down to whether or not the FF672 package had a sufficient
number of I/O pins to interface with all the devices we anticipated. The exact number of I/O pins
required could not be known until all parts for Helios had been chosen. However, with 320 I/O
pins available on the FX20 in the FF672 package, and given a rough idea of the components that
would make up Helios (Section D.3), it appeared likely that the FF672 package would allow us
to connect all anticipated devices while leaving a significant number of I/O pins available for the
expansion header. After final selection of the remaining components was completed, we easily
verified that, indeed, we could connect all components while leaving over 64 I/O pins for use by
the expansion header.
Based on the above facts, Helios was designed to use the Virtex-4 FX20, FX40, and FX60
FPGAs in the FF672 package. Additionally, Helios was designed to support -10, -11, and -12
FPGA speed grades as well as commercial and industrial temperature grades.

283

Non-volatile Configuration Memory
Since the configuration memory of most SRAM-based FPGAs, such as the Virtex-4 family,
is volatile, we must provide a way for the FPGA to configure itself upon power-up. Although
some FPGAs have non-volatile flash memory on-chip, most high-performance FPGAs, such as the
Virtex-4, do not. Instead, an external non-volatile memory must be used. The ability of Helios
to configure itself is critical since Helios is intended for use in applications, such as autonomous
vehicles, where having a programmer handy every time Helios is powered on is not practical.
Xilinx provides several configuration solutions for their FPGAs. One of the most versatile
is the System ACE CompactFlash solution [124], which allows for several configuration sources,
including a standard CompactFlash card. Unfortunately, the System ACE controller chip package
is 22 mm × 22 mm, and the CompactFlash card slot requires a board area of about 40 mm × 50 mm
(the CompactFlash card itself measures 36.4 mm × 42.8 mm). Combined, these components would
take up a significant area on the Helios board, making it much larger. Unfortunately, Xilinx does
not provide a smaller version of System ACE and System ACE does not support smaller flash card
types.
The next preferred configuration solution is the Xilinx Platform Flash [125]. These are
simple flash chips provided by Xilinx and designed specifically for configuring Xilinx FPGAs. As
a result they are very easy to use, support high-speed configuration, and connect directly to Xilinx
FPGAs with little or no external logic. Also, the Platform Flash is inexpensive, consumes very
little power when idle, supports the storage of multiple configurations (also called revisions), and
is available in a very small package (8 mm × 9 mm). This makes the Platform Flash the ideal
solution for Helios.
Another option would have been to design our own configuration solution using off-theshelf, non-volatile memory and other components. However, this would have been significantly
more work and would have been unlikely to lead to a solution superior to the Platform Flash.

JTAG Interface
We must also have a way to program the Platform Flash as well as a way to directly program
and debug the FPGA. The preferred method in industry for doing this is through a JTAG (Joint Test

284

Action Group) connection, also known as IEEE standard 1149.1 [126]. JTAG is a simple serial
interface used to communicate with the integrated circuits in a system for the purpose of test,
debug, and configuration. Devices to be connected to a JTAG port are connected in a daisy chain.
A sample JTAG chain, like that used on Helios, is shown in Figure D.3.

 



   

 
   
   


 









 


 






Figure D.3: Example JTAG chain.

The order of the devices is not important. For the JTAG connection header, we decided
to use a standard 2 × 7, 2 mm pitch, right-angle, shrouded, and polarized header compatible with
Xilinx programming cables, such as the Parallel IV and Platform Cable USB. This is a relatively
small connector that provides sufficient signal integrity for configuration at high speeds up to
24 MHz, fast enough to completely reconfigure the FX60 FPGA in less than one second.

D.4.3

Random Access Memory
Random access memory, or RAM, would be required for both the FPGA and the general-

purpose processor. Many machine vision applications require a memory buffer in which to store
image frames, and some may even require the storage of multiple consecutive frames of video.
The processor needs RAM for the storage of its executable programs and data.
The BRAMs integrated into the FPGA fabric can sometimes fulfill this need for both the
custom logic and processor. However, most FPGAs have only a relatively small amount of memory available in the form of BRAM. For example, the Virtex-4 FX60 FPGA has 232 BRAMs
(Table D.2), each 18 kbit in size—that’s 16 kbit (2 kB) plus 2 kbit for parity. This means that the
FX60 effectively has just over 232 · 2 kB, or 464 kB, of usable memory. The FX20 has only 68
BRAMs, giving it only 136 kB of RAM. To put this into perspective, a single 640×480, 24-bit,
285

color image requires 640 × 480 × 3 = 921, 600 bytes, or 900 kB. A typical embedded Linux operating system configuration requires over 4 MB, although that can be reduced to a few hundred
kB through careful kernel configuration. It is possible to extract more memory from the FPGA by
using the parity memory for data or by using other memory features available in the FPGA, such
as distributed RAM, which uses LUTs in the FPGA to implement RAM. However, the amount of
available memory is still severely limited and using other FPGA features for memory results in
very inefficient memories. Clearly, more external memory is needed.
Based on our experience with embedded systems, we felt that the processor would need
between 16 MB and 64 MB of RAM. This would allow sufficient memory to run nearly any embedded OS (Operating System) or RTOS (Real-Time OS) while leaving plenty of memory for the
storage of image data. The amount of memory needed by the FPGA for machine vision processing
would vary with the application and depends on whether the application uses color or grayscale
images. However, many applications benefit from the ability to store several standard definition
video frames (i.e., VGA or 640×480 resolution images). Grayscale images typically use one byte
per pixel, while color images typically require two or three bytes per pixel. Therefore, a single
VGA video frame requires 300 kB for grayscale and 600 or 900 kB for color. To be able to buffer
several frames, a minimum of 1 MB would be required, while 8 MB would allow us to buffer 27
grayscale images or 13 color images. Therefore, a memory in this range for image processing is
reasonable.

Memory Characteristics
Many different kinds of memory exist that could have been used for Helios. Generally,
available memory can be divided into two categories: DRAM (Dynamic Random Access Memory)
and SRAM (Static Random Access Memory). We could use one or more DRAM chips, one or
more SRAM chips, or a combination of the two. Each of these memory types is also available
with several different architectures. Common DRAM architectures include SDRAM (Synchronous
DRAM), DDR SDRAM (Double Data Rate SDRAM), which clocks data on both the positive and
negative edge of the clock, and more recently DDR2 and DDR3 SDRAM, which are faster and
lower voltage versions of DDR SDRAM. Common SRAM architectures include asynchronous,
synchronous, ZBT (Zero Bus Turnaround), DDR (Double Data Rate), and QDR (Quad Data Rate).
286

DRAM and SRAM each have different advantages and disadvantages. Due to constant
technological advances, both types of memory can deliver very high data throughput. However,
DRAM is generally more power efficient, less expensive, and available in much larger capacities
than SRAM. On the other hand, SRAM tends to have lower latency, little or no turnaround time
between reads and writes, and has a much simpler interface, making it much easier to interface
with custom hardware. These trade-offs are summarized in Table D.5.

Table D.5: Embedded SRAM and DRAM Memory Chip Characteristics

Throughput
Latency
Power
Turnaround
Price
Capacity
Interface

DRAM
High
High
Low
Long
Low
High
Complex

X
7
X
7
X
X
7

SRAM
High X
Low
X
High 7
Short X
High 7
Low
7
Simple X

Based on Table D.5, it is not clear which memory is superior. Indeed, which memory
is the right choice depends on the purpose for which the memory will be used. Consider the
memory used for the processor. Keep in mind that the targeted processor (the PowerPC 405) has
data and instruction caches. Of the factors listed in Table D.5, the most important is probably
capacity for a given price. This factor has largely driven the wide usage of DRAM in generalpurpose computers. The purpose of the processor’s caches is to hide the increased latency imposed
by the underlying memory. Therefore, high performance can still be maintained if there is good
data locality and the underlying memory has sufficient throughput to feed the cache. Modern
SDRAM architectures provide high throughput by supporting burst transfer modes. That is, once
the initial memory access latency has expired, each additional sequential memory address takes
only one additional cycle to be accessed up to some maximum burst length. Additionally, each
time a portion of the cache is filled or flushed, the cache always reads or writes memory in large
blocks (e.g., cachelines), ensuring that the memory system can take advantage of the burst feature.
Because of these factors, modern DRAM is a superior choice as a processor memory.

287

On the other hand, many image processing operations and machine vision algorithms have
more random memory access patterns that cannot always benefit from burst memory transactions.
With DRAM, accessing a memory location at one memory address followed by a memory location
at a slightly distant address imposes a significant delay between accesses. This latency is shown in
the SDRAM timing diagram of Figure D.4. In this figure, the initial access starts with the RAS, or
row address strobe. This is followed by the RCD, or RAS to CAS delay, which is typically one to
three cycles. This is then followed by the CAS, or column address strobe, which is then followed
by the CAS latency of one to three cycles. The actual delays for tRCD and tCAS depend on the speed
grade of the SDRAM and its operating clock frequency—the higher the clock rate of the SDRAM
the more cycles must be dedicated to these delays. For a typical embedded SDRAM operating near
its highest clock frequency, the total access latency is usually six clock cycles. This long latency
can represent a significant performance penalty for many custom hardware circuits on the FPGA,
and implementing a cache in the custom hardware to hide a DRAM’s latency would be nontrivial.
Fortunately, SRAM typically does not suffer from these high latencies. The timing diagram
for a ZBT SRAM is shown in Figure D.5. Note that there is no turnaround time between reads
and writes, and the addresses for each access need not be sequential. The pipeline delay for a ZBT
SRAM depends on the SRAM model, its configuration, and its operating clock frequency, but is
either one cycle or two. Additionally, the interface to such SRAM is very simple, meaning that
a complex memory controller is not necessary to use the memory. This is unlike DRAM, which
requires a fairly complex memory controller to ensure proper operation. Based on these factors,
SRAM is generally a superior choice as the memory for use by custom hardware for machine
vision processing.

 

















 






Figure D.4: SDRAM read or write timing diagram [127]. The COMMAND bus in the
figure encapsulates all the SDRAM signals relevant to requesting a transaction, such as
CAS, RAS, WE, ADDR, and so forth.

288



 






 

 

 







 

 

  



 

 





Figure D.5: Pipelined ZBT SRAM timing diagram [128]. The COMMAND bus in the
figure encapsulates all the SRAM signals relevant to requesting a transaction, such as
WE, OE, ADDR, and so forth.

Let us again consider the three options for Helios’ memory system: DRAM memory,
SRAM memory, or a combination of the two. If we use only DRAM chips, many applications
will take a performance hit imposed by the increased memory latency on the custom image processing hardware. If we use purely SRAM, then the programs that run on the general-purpose
processor have a very limited memory capacity and Helios must suffer from the increased power
consumption of using only SRAM. The third option is to have both SRAM and DRAM. This solution gives the best of both alternatives, providing DRAM for the general-purpose processor and
SRAM for use by custom image processing hardware on the FPGA. The disadvantage of this solution is that the bandwidth of each memory type is effectively cut in half. This is because the target
size and PCB technology of Helios only affords enough board area for two memory chips. If we
were to use only one type of memory then Helios could support two chips of the same memory
type, effectively doubling the bandwidth of that memory. Ultimately, using two different types of
memory was the best choice for Helios, since it provided the greatest amount of flexibility.

Choosing a DRAM Chip
At the time that Helios was being designed, there were essentially three types of highperformance DRAM available on the market. These included SDR (Single Data Rate) SDRAM,
DDR SDRAM, and DDR2 SDRAM, with DDR2 being the most advanced and highest performing
DRAM available. DDR3 SDRAM was not yet available.
Unfortunately, the hardware IP modules necessary to use DDR2 on Xilinx FPGAs were
not yet available. As a result we would be required to design our own memory interfaces in order
to use DDR2 memories on Helios, a non-trivial design task. Additionally, the DRAM would be
289

used primarily as the memory for the general-purpose processor, which did not need the higher
performance of the latest, most technologically advanced memory.
The next choice on the performance scale was DDR SDRAM. DDR memory also had
several disadvantages. First, standard DDR memory is only available in x4, x8, and x16 bus widths.
In other words, standard DDR memory assumes that multiple memory chips will be combined in
order to form wider buses. Unfortunately, this was greatly at odds with our need to keep Helios
small and power consumption low, so it was important to limit Helios to a single DRAM chip.
Second, DDR SDRAM is known for its high power consumption. This is due to its higher
toggling rates (DDR data lines toggle twice per clock cycle), higher internal data rates, higher
2.5 V operating voltage, and SSTL termination. Most DRAM vendors do provide what are called
mobile versions of DRAM, which are designed for lower power operation through the use of power
saving features and lower operating voltages. Unfortunately, mobile DDR SDRAM was not yet
available.
Third, the board design for DDR SDRAM is much more complicated than SDR SDRAM.
This is largely due to the higher frequency of operation imposed by the double data rate. In order
to ensure proper operation, the characteristic impedance of each PCB trace for the memory must
be carefully controlled and the length of all traces must be the same. Trace impedance is relatively
easy to control just by controlling the trace widths. Trace length matching, however, is typically
supported through the PCB design tool. However, the version of OrCAD being used for the design
of Helios did not support trace length matching, which meant that all trace lengths would have
to be measured and adjusted manually. Additionally, the SSTL termination expected for DDR
signals requires an additional, specially designed power supply operating at VTT = 0.5 × VDDQ ,
where VDDQ is the I/O power supply voltage of the memory.
The third DRAM option was to use SDR SDRAM. Fortunately, standard SDRAM was
readily available in low-voltage (1.8 V) mobile versions which operated at the same data rates as
standard SDRAM. Also, mobile SDRAM, due to its typical use in compact devices, is available
in x32 data widths. This gives a single x32 chip the same bandwidth as two x16 SDRAM chips,
allowing board area to be reduced.
These constraints left us with two DRAM options: a x16 DDR SDRAM with an operating
clock frequency up to 200 MHz (with two data samples per clock cycle) or a x32 SDR SDRAM
290

with an operating frequency up to 133 MHz (with one data sample per clock). Table D.6 compares
the characteristics of the two memory options available for our use.

Table D.6: Mobile SDR SDRAM and DDR SDRAM Comparison [127], [129]

Bus Width
Maximum Frequency
Operating Voltage
Peak Throughput
Maximum Capacity
Package Size
Burst Read Power
Idle Power
Termination Power

Mobile SDRAM
x32
133 MHz
1.8 V
4.26 Gbps 7
512 Mb
7
8×13 mm X
216 mW X
45 mW
X
Low
X

DDR SDRAM
x16
200 MHz
2.5 V
6.40 Gbps X
1 Gb
X
8×14 mm 7
650 mW
7
175 mW
7
High
7

Based on the comparison in Table D.6, the main advantage to using DDR SDRAM is the
higher peak data rates, assuming both chips are operating at their highest operating frequencies. If
they are operated at the same frequency, then the actual throughput is the same, since the mobile
SDRAM is twice as wide but the DDR SDRAM inputs/outputs data twice per clock. The potentially higher performance of DDR SDRAM comes at a very high cost in the form of more than
three times higher power consumption. The termination power consumption is particularly important. Due to the lower clock speeds, SDRAM typically uses series termination, whereas DDR is
designed to use SSTL2 termination (2.5 V Stub Series Termination Logic) as defined by JEDEC
standard JESD8-9B [130]. These two termination methods are shown in Figure D.6.
Series termination is very simple and has the the advantage that power is only consumed
during signal transitions, with essentially no power being consumed during steady state. On the
other hand, SSTL consumes power during both signal transitions and steady state. For example,
consider Figure D.6(b) when the driver is driving a high (2.5 V) or low (0 V) voltage signal. In
both cases, the steady state power consumption is
P=

(1.25 V)2
V2
=
= 20.8 mW.
R
(50 Ω + 25 Ω)

291

(D.1)




 
 

(a) Source Series Termination for SDR



 

  




 
 


 

±

(b) SSTL2 Class I Termination for 2.5 V DDR

Figure D.6: Memory termination techniques for typical unidirectional 50 Ω traces.

Taking into account that a 256 Mb, x16 DDR SDRAM has about 40 high-speed signals requiring
such termination, this termination technique leads to a significant amount of power consumption.
DDR SDRAM also allows for higher capacities than mobile SDRAM, but the maximum
1 Gb (128 MB) memory capacity was beyond the 16 MB to 64 MB range we had established for
Helios (Section D.4.3).
In order to minimize the size and power consumption of Helios, we decided to use the x32
mobile SDRAM memory, leading to about a 33% reduction in peak throughput but allowing for
significant power savings and a much simpler board design.

Choosing an SRAM Chip
Choosing an SRAM chip involved many of the same trade-offs as the selection of DRAM.
Common SRAM types include asynchronous, synchronous, ZBT (Zero Bus Turnaround), DDR
(Double Data Rate), and QDR (Quad Data Rate). Additionally, updated versions of DDR and
QDR are available in the form of DDR-II and QDR-II, which operate at higher speeds and use a
lower core voltage (1.8 V instead of 2.5 V). Note that ZBT is a registered trademark of IDT, Inc.,

292

which originally developed the technology. Other vendors use different names to describe their
ZBT-compatible memory. For example, ZBT is also called NoBL (No Bus Latency) by Cypress,
NBT (No Bus Turnaround) by GSI, No-Wait by ISSI, ZBL (Zero Bus Latency) by Hitachi, NtRAM
(No Turnaround RAM) by Samsung, and ZEROSB (ZERO Synchronous Burst) by NEC.
ZBT is a synchronous memory that can switch immediately from reading to writing, or vice
versa, without inserting any delay cycles. ZBT devices are available in two forms, called pipelined
and flow-through. Pipelined devices impose a two-cycle delay between when a read/write operation
is requested and when the data are actually read/written, thus adding some latency. Flow-through
devices have a single cycle delay. For example, a read/write can be requested on one clock edge
and the data can be read/written on the following clock edge. Flow-through behavior is available
at the expense of lower operating frequencies. Both pipelined and flow-through devices allow
for very high throughput since they both allow for 100% memory bus utilization to be achieved,
which is not possible with standard synchronous SRAM (see Figure D.5 for an example of 100%
bus utilization over several read/write cycles). Additionally, the simplicity of their interfaces make
them very easy to use with little interface logic.
DDR SRAM can input or output two data words per clock cycle by using both the rising
and the falling edge of the clock, just like DDR SDRAM discussed previously. They also typically
support two or four-word bursts, minimizing the number of read/write requests that need to be
made to use the device. DDR SRAM is also available in two fundamental forms, called CIO
(Common I/O) and SIO (Separate I/O). CIO DDR SRAM imposes a bus turnaround delay between
reads and writes and is therefore optimized for applications where the memory bus operates in one
direction for long periods of time. To eliminate this turnaround penalty, SIO devices have separate
input and output ports. However, only one read or write request can be in progress at any one time.
QDR SRAM is also dual-ported, with separate and dedicated read and write ports. Unlike
with SIO DDR, QDR allows for read and write operations to be in progress at the same time. The
dual ports also eliminate any turnaround time. This allows for both the read and write ports to
achieve 100% bus utilization, providing the highest possible performance.
The increased performance of DDR and QDR SRAM memories comes at the high cost
of greatly increased power consumption. Additionally, SIO DDR and QDR SDRAM devices require a much larger number of I/O pins to support them, due to the separate read and write ports.
293

This is an important factor in the high power consumption and requires the use of a larger FPGA
package to support the increased number of I/O pins. DDR and QDR memories also employ the
HSTL (High-Speed Transceiver Logic) communication standard, which uses a parallel termination
technique, similar to SSTL, and consumes more power than serial termination techniques.
In order to keep power consumption, cost, and complexity low, we selected a high-speed,
low-voltage, x36, ZBT SRAM for Helios with a maximum operating frequency of 250 MHz and
capacities ranging from 9 Mb to 72 Mb [128].
This SRAM has a sustainable throughput of 8.0 Gbps, much higher than both the SDR
and DDR SDRAMs we considered. This is equivalent to the transfer of 3,255 grayscale VGA
images per second to or from memory. The low 2.5 V core voltage also gives it much lower power
consumption than traditional 3.3 V SRAM chips. These characteristics make it an ideal choice as
an image processing memory for custom hardware.

RAM Power Supply
Although both the 1.8 V mobile SDRAM and the 2.5 V SRAM represent excellent compromises between performance and power consumption, the use of both requires two different power
supplies. Efficient power supplies require a lot of board area, and efficiently distributing the multiple voltages on an area-constrained PCB while maintaining signal integrity is a significant design
challenge. Additionally, switching power supplies tend to be much more efficient under a moderate current load. Thus, by using devices that share a common power supply, we can increase the
overall efficiency of the power distribution system while reducing board complexity. The design
of the power supplies will be discussed in detail in Section D.4.9.
In order to consolidate the number of voltages required on the board, it was decided to use
the 2.5 V version of the mobile SDRAM chip. This chip consumes more power than the 1.8 V
chip due to the increase in voltage. However, it is still significantly more power efficient than the
2.5 V DDR SDRAM chip and allows us to consolidate the RAM power supply into a single, more
efficient voltage regulator.

294

D.4.4

Non-volatile Storage
A non-volatile memory device is required in order to be able to boot load the CPU’s oper-

ating system and programs. Several solutions exist, including NOR flash memory chips, NAND
flash memory chips, and flash memory cards.
NOR flash memory chips are available in x8 and x16 data widths and have a read interface
that is compatible with standard, low-latency, asynchronous ROM (Read-Only Memory). This
makes them compatible with standard memory interfaces and very easy to use. The write interface, although different from standard RAM, does not require any special hardware to support.
Instead, writes are performed by first writing specific commands to the flash, in order to change
the mode of the device, followed by the data to be written. Additionally, before writes can take
place, the block of flash memory to be written must have been previously erased by writing all
ones. Most NOR flash memories are CFI (Common Flash Interface) compatible, an open interface standard approved by JEDEC [131]. This standard defines the commands used for unlocking,
erasing, and writing NOR memories and allows flash memories from different vendors to be used
interchangeably.
NAND flash memory chips, as well as flash memory cards which generally use NAND
flash, typically have a more complex interface that is not compatible with standard memory interfaces. Despite the more complex interface, NAND flash memory has several advantages over
NOR flash. First, NAND flash memory chips are available in much larger capacities. NOR flash
is typically not available in densities higher than 512 Mb whereas current NAND flash chips can
have densities as high as 64 Gb, or 128 times the capacity of NOR. These densities are constantly
being improved through the advancements in technology. NAND flash is also designed for high
throughput, streaming data transfers. These features make them ideal matches for many embedded products, such as mobile music players, video players, still cameras, video cameras, and other
hand-held devices. These characteristics come at the cost of much higher latencies. The access
time for a high-density NAND flash memory is usually tens of microseconds, compared to around
a hundred nanoseconds for NOR flash memory. This makes NAND flash memory unsuitable as a
fast, random access memory, such as a read-only program or data memory.
Perhaps the most important factor for the design of Helios was compatibility. NAND flash
requires a complex controller to obtain the highest performance, whereas NOR flash can use a
295

standard memory controller. Additionally, the Xilinx Embedded Development Kit provides tools
and software for programming and boot loading from NOR flash memories. No such tool support
was available for NAND flash memories. Since the non-volatile memory on Helios was intended
primarily as a program memory for boot loading, NOR was the natural choice for Helios.
We selected a standard NOR flash in a 64-ball BGA package. This package allows us to
use a wide variety of x16, pin compatible NOR flash memories, including the Intel StrataFlash
P30/P33, Intel J3 v. D, Micron Q-flash, and other compatible devices from various vendors. Available capacities range from 4 MB to 64 MB with access times typically just over 100 ns, depending
on the chip density.
Some applications may benefit from a high density NAND flash memory, for storing video
or image samples. However, many applications do not require such support. As a result, a high
density NAND flash device is best added as an application-specific daughter board component.

D.4.5

Communications Interfaces
In addition to the JTAG port, Helios needed to have some standard communications in-

terfaces to allow it to communicate with a standard PC for debugging and user defined purposes.
Such ports would allow us to send and receive commands or download data, such as processed
images. At a minimum, we felt that Helios needed a standard serial port. The simplicity of the
standard UART (Universal Asynchronous Receiver/Transmitter) meant that we could get Helios
up and running quickly with a minimum of infrastructure and development. The UART also serves
as a communications backup during debugging, when more complex communications interfaces
may be having their own problems.
In addition to the serial port, Helios needed to have a high-speed interface to allow the
download of images, video, and other voluminous data at or near real-time speeds. This kind of
capability would prove invaluable for the development and debugging of real-time video algorithms. The most common interfaces available for this purpose were Ethernet, USB (Universal
Serial Bus), and FireWire.
Ethernet had the advantage that it would allow Helios to take full advantage of the Linux
operating system. Additionally, the Virtex-4 FX FPGA has built-in Ethernet MACs (Media Access
Controllers), reducing the amount of external hardware required to support Ethernet. Although
296

this seemed very convenient, Ethernet is not required for Linux, if we decided to use it, and the IP
modules needed to support the built-in Ethernet MACs were not yet available.
Ethernet has several other disadvantages as well. First, in order to support TCP/IP and
UDP/IP, the standard network protocols used over Ethernet, a complex communications stack is
required. This protocol stack is handled through CPU-intensive software and OS support, which
drains a lot of processing performance from our CPU. Second, Ethernet is not as embeddedfriendly as other interfaces, such as USB, in that it requires a large connector and is not as power
efficient. Third, the IP necessary to connect Ethernet to the FPGA was not free at the time Helios
was being designed. And finally, gigabit Ethernet and accompanying IP was not yet widely available, so Helios would be restricted to 100 Mbps Ethernet. This would make the Ethernet option
much slower than USB 2.0 at 480 Mbps or FireWire at 400 Mbps. The 100 Mbps limitation of
Ethernet is significant, since 16-bit color VGA video at 30 fps requires a bandwidth of 147 Mbps,
making real-time data transfer over Ethernet impossible for full resolution color video.
The remaining choice between USB and FireWire was influenced most by availability. USB
had achieved market dominance and spare USB ports were available on virtually every personal
computer. Additionally, embedded chips for supporting USB were also widely available and easy
to use. FireWire, on the other hand, would generally require a FireWire card or adapter to be
installed in the PC in order to connect Helios. As a result, we selected USB for Helios.

Universal Serial Bus
For the USB 2.0 interface, also known as high-speed USB, we wanted to choose a device
that was low-power, small, and made our job as simple as possible. We were already familiar with
the family of EZ-USB chips from Cypress, since they could be found on our USB cameras as well
as Xilinx development boards. This family of USB devices makes it very easy to implement a
low-power, high-speed, USB 2.0 interface, without sacrificing performance. The EZ-USB device
integrates a USB PHY (physical interface) and additional hardware to handle all of the USB protocol and convert the USB data to a simple, high-speed interface, such as a bidirectional FIFO. The
EZ-USB device’s configuration can be automatically loaded upon power-up from a small EEPROM (Electronically Erasable Programmable Read-Only Memory).

297

Each USB chip is usually designed to be either a host or a peripheral. The host is essentially
the sole master on the bus, responsible for managing the devices connected to the bus. Peripherals act as slaves to the host. Since we wanted Helios to communicate with a PC, which always
uses a USB host device, the USB device on Helios must necessarily be a peripheral. As a USB
peripheral, the Helios USB interface is incapable of connecting to other peripheral devices, such
as keyboards, cameras, or USB flash drives. As a result, the interfaces for such devices must be
added as application-specific additions on a daughter board.
One of the advantages of USB is its relatively low power consumption. For example, the
CY7C68014A [132] USB controller selected has a peak power consumption of 281 mW. However,
because the USB connector is powered, the USB device can actually be powered from the bus
rather than consuming power from Helios. This was the approach taken on Helios. As a result,
the USB device consumes no power when it is not in use. This capability required special design
care to ensure that the USB device powered up and down correctly as the USB cable was inserted
and removed, without being affected by stray currents coming from the Helios power supply and
without adversely affecting the FPGA to which the USB device is connected.
Another advantage of USB is its small size. The EZ-USB chip is available in a 56-lead,
8 mm × 8 mm, QFN package, which is ideal for Helios. There is also a miniaturized version of the
USB 2.0 receptacle connector, called Mini-B. This same receptacle is commonly used on mobile
phones phones and PDAs due to its small size.
In order to function, the USB device also needs a 24 MHz clock source. This is the clock
from which the USB chip generates the 480 MHz clock needed for USB 2.0 as well as the clock
used for the chip’s x16 data interface and internal logic. Helios also connects this clock to the
FPGA so that it can be used as an alternative clock source to the main clock driving the FPGA.

Serial Port
The serial port is supported on Helios through an RS-232 compatible line driver and receiver, such as the Texas Instruments SN65C3221, which converts from 3.3 V or 5 V UART
signaling to RS-232 compatible logic levels. These devices are very inexpensive, very low-power,
and small. Instead of using a standard DB-9 connector on Helios, which would be far too large,

298

we decided to use a very small, 5-pin, 1.25 mm pitch connector. A simple, custom-built dongle is
used to convert the small 5-pin plug to a standard DB-9 connector.
A standard serial port on a PC typically supports signal rates up to 115.2 kbps. However,
typical RS-232 line drivers can support much higher data rates, up to 1 Mbps. On the PC side,
many USB-based serial port adapters support these higher data rates. Thus, the serial port on
Helios is capable of much higher data rates than a standard PC serial port. In order to support the
higher data rates, the 5-pin serial port on Helios has three ground wires, isolating the TX and RX
data lines. This improves the signal integrity of the cable to allow for more reliable transmission
at higher data rates.

D.4.6

Expansion Connector
The expansion header would allow the base functionality of Helios to be extended through

custom daughter boards. The expansion connector would need to have enough contacts to support
the devices we anticipated connecting to Helios on the daughter board. A typical Helios application
might involve two cameras (e.g., for stereo vision), a wireless transceiver, a digital compass and/or
GPS, two or more servos, and a few general-purpose I/O pins. Given this number of devices,
and the number of I/O signals each device requires, we felt 64 I/O signals was sufficient for the
expansion header on Helios.
In addition to the I/O signals, the header would also need to deliver power and ground
lines to the daughter board. In order to provide good signal integrity across the header, a standard
approach is used that staggers the data lines between power and ground lines so that every I/O
signal on the header runs adjacent to a ground line. This reduces the inductance and cross-sectional
area of the current loops generated by signals propagating across the header (see Section D.5.1).
This improves signal integrity, reduces generated electromagnetic noise, and makes the signals less
susceptible to interference.
As a result of the additional power and ground lines, the expansion header would need
twice as many contacts as I/O signals. In order to fit this many contacts on a single header, the
pitch between contacts would need to be quite small. Such high-density headers are typically
available with contact pitches between 0.5 mm and 1.0 mm and widths that are a multiple of 20

299

(e.g., 80, 100, 120, etc.). Given the number of I/O signals required for a Helios daughter board, at
least a 120-contact header would be required.
Despite such connectors being relatively common, we found it difficult to find a header
that was readily available in the small quantities needed for Helios. As a result, the number of
options available to use was actually quite small. This restricted availability led us to select the
AMP/Tyco, 0.8 mm pitch, 120-contact, Free Height connectors. The 0.8 mm contact pitch is small
enough to keep the overall length of the header small without being so small that the mating header
on the daughter board cannot be soldered by hand. The default connector chosen for Helios allows
for stacking heights of 9, 10, 11, or 12 mm, depending on the corresponding connector height
used on the daughter board, although heights from 5 to 16 mm can be achieved by installing a
different header on Helios. This allows us to minimize the stacking height of Helios, but also
allows for some flexibility depending on what components need to be placed between Helios and
the daughter board. The length of this connector is 53.8 mm, which allowed us to keep Helios very
small. In fact, the length of this header, in the end, would dictate the overall width of Helios.

D.4.7

Simple I/O
In addition the more sophisticated I/O provided by the USB and serial port, Helios needed

to have some basic input and output in the form of buttons, switches, and LEDs (Light-Emitting
Diodes).

LEDs
Helios needed to have at least a power LED, to indicate when the board was receiving
power, and a program LED to indicate whether the FPGA was programmed. These are standard
LEDs on almost any FPGA board. In addition, we decided to add two additional user LEDs, one
red and one green, connected to the FPGA that can be used for any purpose the user sees fit. Additionally, since the USB device is powered through the USB connector, a USB power/connection
LED was added next to the USB connector.
In order to minimize the power consumed by the LEDs and the accompanying resistors,
low-current LEDs were used. These LEDs require far less power to achieve full brightness and

300

will remain lit even when very low currents are used. Also, to minimize the board area required,
small surface mount LEDs in the 0603 package were used.

Push Buttons
The minimum set of push buttons included a program button, which clears the FPGA configuration and causes it to reload a configuration from its accompanying PROM (Programmable
Read-Only Memory), if available, and a reset button for the CPU. In addition we decided to include one user button, to be used as needed for the application. In reality, since the reset button is
connected to the FPGA, it can also be used as a general-purpose button.
The user button and the reset button also feature a debounce chip to ensure a clean, bouncefree signal from the mechanical buttons. In order to keep the size of the Helios board small,
these buttons would need to be much smaller than typical push buttons. We selected a very small
(4.5 mm × 3.55 mm), tactile, momentary switch for this purpose. This switch is just big enough
to be pushed with a small finger or fingernail.

Switches
The Platform Flash for the FPGA also requires two switches to allow selection of the design
revision to be downloaded to the FPGA. This feature allows multiple FPGA configurations to be
saved on the Platform Flash. These two switches then allow the user to select from up to four
different configurations to be downloaded to the FPGA upon power-up or when the program button
is pushed. In addition, we added one switch to allow the user to select between two different FPGA
configuration modes, which can be either JTAG mode or master serial mode. In JTAG mode, the
FPGA does not attempt to configure itself after power-up, but instead waits to be configured using
the JTAG connection. In master serial mode, the FPGA acts as a master after power up and attempts
to configure itself by reading the attached Platform Flash memory. Other configuration modes are
supported by the FPGA but are not implemented on Helios.
In addition, a switch was added to control the HSWAPEN feature of the FPGA, which
allows the user to enable or disable pre-configuration pull-ups on the I/O pins of the FPGA. Finally,
we added two user switches, to be used as desired for the application.

301

All of these switches are combined in a single 6-position, sub-miniature DIP, single pole
single throw (SPST) switch. These switches are so small that a pencil tip or other small object is
generally needed to toggle them. This small switch package minimizes the area taken up by the
switches, which would normally take up a large amount of board area.

Simple I/O Placement
Since Helios uses a stacking configuration, all LEDs, buttons, and switches need to be
placed at the edge of the board where they will be visible and accessible, even if a daughter board
is stacked on top of Helios. In order to be able to reach the buttons, right-angle versions were
used so that the buttons can be depressed from the side. Unfortunately, we were not able to find a
right-angle DIP switch in a sufficiently small package to be used on Helios, so a standard switch
orientation was used. This makes it a bit more difficult to toggle the switches, but still relatively
easy with the aid of a pencil tip or other small object. Table D.7 summarizes the simple I/O
interfaces available on Helios.

Table D.7: Simple I/O Interfaces on Helios
LEDs

Buttons

Switches

D.4.8

Power
FPGA Done
USB Power
User (2)
FPGA Program
CPU Reset
User
Configuration Mode
Configuration Revision (2)
HSWAPEN
User (2)

Clock Sources
As discussed in Section D.4.5, the USB device requires a 24 MHz clock, which is also

connected to the FPGA. In addition to this we added an additional clock to be used as the primary
302

system clock for the FPGA circuitry. This clock can be scaled in a variety of ways using one of
the DCMs (Digital Clock Managers) on the FPGA.
Due to the large size of typical sockets and clock chips that support socket insertion, we
selected smaller, surface mount clocks. These small clocks can only be removed by unsoldering
them and soldering on a new clock. This is an inconvenience in some situations, but in most
situations this can be avoided by scaling the clock frequency using a DCM.
The default clock we selected is 100 MHz. We have found this to be an excellent choice as
a default clock on the Virtex-4, since meeting timing at this frequency is usually feasible without
having to put forth an unreasonable amount of design effort. Yet this frequency is high enough to
allow for designs with very high throughput.
Additional clocks may also be present in the system. For example, the USB device generates a 48 MHz clock from its 24 MHz clock that is used as the clock for the x16 synchronous
data interface between the USB device and FPGA. Cameras connected to the daughter board also
usually deliver their own clocks. As a result, a typical Helios system will use several different
clocks in different parts of the FPGA, making a good understanding of clock domain management
and inter-domain communication essential for any designer using Helios.

D.4.9

Power Supplies
The design of the power supplies and the power distribution system (PDS) was one of the

most challenging aspects of the design of Helios. This is because power supply design requires specialized knowledge well outside the domain of typical digital designers. Additionally, the FPGA
is a complex device with unusual power requirements, which makes the design of the decoupling
network particularly important. The PDS of Helios uses nearly 200 capacitors to decouple the
system and compensate for inductive effects. An excellent overview of designing the PDS for a
PCB with a Virtex-4 FPGA can be found in [133].
Thanks to our efforts to minimize the number of voltages on Helios, the board requires only
the voltages summarized in Table D.8.
Each of these voltages must be provided by an on-board voltage regulator. The purpose of
the voltage regulator is to take the input voltage, provided through a single power connector on

303

Table D.8: Helios Component Voltages
1.2 V
1.8 V

2.5 V

3.3 V

FPGA Core
Platform Flash
SDRAM
SRAM
FPGA I/O Banks (5)
FPGA VCCAUX
Expansion Header
JTAG
Switches
Buttons
Flash
USB
FPGA I/O Banks (4)
Expansion Header
LEDs
RS-232 Driver
Clocks

Helios, and convert it to the needed voltage, and to maintain that voltage, even if the input voltage
varies.
There are basically two types of voltage regulators available: linear and switching. Linear
regulators are typically very small, very simple to use, and inexpensive. In addition, they provide
excellent isolation from the input power supply, meaning that they pass very little noise through to
the devices they are supplying. However, linear regulators are very inefficient. The efficiency of
a voltage regulator is determined by the percentage of input power that is actually output by the
regulator. A regulator that takes in a large amount of power and outputs a much smaller amount
of power would be considered inefficient. The efficiency of a linear regulator can be approximated
using the equation
E=

VOUT
.
VIN

(D.2)

For example, suppose a linear regulator is used to power the FPGA core voltage of 1.2 V
and the input supply is a 9 V source. The efficiency of this arrangement is only about 13%. In
other words, the voltage regulator would consume 6.5 times as much power as the FPGA! This is
not only inefficient, but since the power must be dissipated in the form of heat by the regulator,
304

such an arrangement would require large heat sinks to prevent overheating. The power dissipated
in a linear regulator is approximated by the equation
P = I(VIN −VOUT ).

(D.3)

This power dissipation greatly limits the amount of current that can pass through a linear voltage
regulator and the input to output voltage difference before the regulator overheats. As a result,
linear regulators are typically only used for low current devices where the device voltage is not far
below the input voltage. The exception is applications where power consumption and heat sink
size are of little or secondary concern.
Since Helios is intended to be power efficient, linear regulators cannot be used, except in
two cases. The first case is the 1.8 V supply of the Platform Flash. The Platform Flash is a very
low-power device that consumes a small amount of power when Helios is first powered on and
the FPGA configures itself. After configuration, the Platform Flash automatically goes into a lowpower, standby mode. However, still desiring to keep power consumption low, the 1.8 V regulator
is powered off the 2.5 V supply, giving it an efficiency of about 72%. The second case is the
3.3 V USB voltage regulator. Since the USB subsystem is designed to be powered from the USB
connector’s power, which is supplied by the connected PC, the inefficiency of this regulator does
not affect the amount of power drawn from the main power input of Helios.
In order to maximize efficiency, the remaining three power sources are supplied by switching regulators. Switching regulators work by using a pair of MOSFETs to essentially switch the
supply voltage on and off very rapidly such that the average output voltage is maintained at the
desired voltage level. An inductor and several capacitors are used to smooth the output and deliver
a constant voltage. The disadvantage of this approach is that switching voltage regulators produce
a much noisier power supply. However, such regulators can typically achieve very high efficiencies
between 80% and 95%.
One of the challenges of the Helios power supplies was allowing for a wide range of input
voltages. Unless Helios could support a wide range of voltages, many applications would require
additional voltage regulation to bring voltages into the range required by Helios. Do to the ineffi-

305

ciency of each power supply, the more regulators that are connected in series, the less efficient the
power supply system becomes.
To handle the wide range of input voltages desired, we selected the Linear Technology
LTC1778 Wide Operating Range No RSENSE Step-Down Controller [134]. This controller requires
a number of additional components, including two N-channel MOSFETs, a high-current, lowresistance inductor, several input and output capacitors, as well as several other resistors, capacitors, and diodes that determine the behavior of the controller. The selection of these components
is dictated by a complex set of equations and guidelines. Component selection is further complicated by the need to keep the power supply designs as small as possible. This necessitated the use
of compact, ferrite core inductor coils, low-ESR (Equivalent Series Resistance), specialty polymer aluminum capacitors, and unusually small, high capacitance ceramic capacitors in addition to
standard resistors and capacitors.
Depending on the components selected, the LTC1778 controller can support an input voltage range from 4 V to 36 V, much wider than most voltage regulators. Due to practical design
limitations, and the availability of low-ESR, small, high-voltage input capacitors, we limited the
input voltage to the range of 5 V to 24 V. The main disadvantage to this limitation is that it forbids
the use of a standard 28 V military power rail to power Helios. To date, this limitation has not been
a problem.
The final specifications for the three Helios power supplies is shown in Table D.9. The
2.5 V and 3.3 V supplies are capable of delivering far more power than is needed for Helios. This
additional power capacity is intended for the daughter board and its devices. On the Helios board,
the FPGA is the most power hungry device, consuming as much as 6 W of power before requiring
a heat sink and/or cooling fan. The 1.2 V supply therefore supports the highest amount of current.
As a result, the 1.2 V supply also requires a larger inductor and MOSFETs with higher current
carrying capabilities. All three supported Virtex-4 FX FPGAs are capable of drawing more current
than the 1.2 V supply is capable of delivering. However, in practice this is unlikely for the FX20,
but quite possible for the FX60. As a result, designers must monitor the power consumption of
their designs in order to avoid exceeding the maximum limits of the 1.2 V supply or overheating
the FPGA to the point that it malfunctions or damages itself.

306

Table D.9: Helios Power Supply Specifications

Min Input Voltage
Max Input Voltage
Max Output Current
Max Power Output

1.2 V Supply
5V
24 V
6A
7.2 W

2.5 V Supply
5V
24 V
3A
7.5 W

3.3 V Supply
5V
24 V
3A
9.9 W

The wide operating range of the three voltage regulators allows all three to be connected
to the same input supply. In contrast to this, many circuit boards have a single voltage regulator
that brings the main power input down to a standard voltage, such as 3.3 V or 5 V. Then the
remaining voltage regulators are fed off this regulated voltage. This approach has the advantage
that it simplifies the design and requirements of the lower voltage supplies, since the input voltage
is fixed and known in advance. However it also results in lower power efficiency overall. For
example, suppose that under a specific load the main voltage regulator is 80% efficient and the the
other regulators are 90% efficient. This results in an overall efficiency of 72%, much lower than
the efficiency of the main voltage regulator alone. Helios connects all three voltage regulators to
the same input supply to avoid the inefficiency caused by chaining voltage regulators.
The efficiency of a switching voltage regulators depends on the current load and the input
voltage, as well as the components used in the design of each supply. A small current load or a load
approaching the maximum current both result in less efficient operation. Additionally, the higher
the input voltage, the less efficiently the voltage regulator operates. These relationships are shown
in Figure D.7, which shows the efficiency for a 2.5 V, 10 A maximum current, switching voltage
regulator implemented using the LTC1778.
The power supplies are also protected from reversed voltage (i.e., the positive and negative terminals of the power supply being connected backwards). The most common approach to
reverse-voltage protection is to pass the input supply current through a high-current diode. This
is fine when power consumption is not a concern, but the voltage drop across the diode can result
in significant power dissipation when the input current is high. A much more efficient solution
is to use a P-channel power MOSFET, which typically has a much lower drain to source voltage
drop. In Helios, all input current passes through such a P-channel power MOSFET before reaching

307

100

VOUT = 2.5V

VIN = 5V

Efficiency (%)

90

VIN = 25V
80

70

60
0.01

1

0.1

10

Load Current (A)

Figure D.7: Switching regulator efficiency [134].

the input of the three voltage regulators. If a reverse voltage is applied, the gate of the MOSFET
will have a higher voltage than the drain, turning the transistor off and preventing current flow.
This MOSFET is carefully chosen to allow current to pass through with minimal drain to source
resistive loss and to prevent MOSFET damage when a reversed voltage is applied.
Another important characteristic of the power supply is its ramp-up behavior (i.e., how the
voltage of each regulator rises up to its target voltage). Most devices expect the voltage of a power
supply to rise linearly after being powered up. Additionally the rise time needs to be within a
certain range. If the power supply does not meet the specification set forth by the chip designer
then the chip’s power-on reset (POR) circuitry may not function properly. As a result, the chip
may malfunction after begin powered on. In the case of the Xilinx Virtex-4, the power supplies
needed to rise monotonically with a rise time between 20 ms and 50 ms. Unfortunately, our power
supplies did not meet this specification. Similar constraints were violated for the Platform Flash
and USB, each of which used a different power supply.
The solution to this problem is to use supervisory circuits to hold each of these devices
in reset until a fixed amount of time has elapsed and the power supply voltages have stabilized.
These supervisor chips are very small (less than 3 mm × 3 mm) and very inexpensive components.
One is used to ensure proper reset of both the FPGA and Platform Flash while a second is used to
ensure proper reset of the USB device after a USB plug is inserted.

308

D.4.10

Resistors and Capacitors
In addition to the essential components discussed in the previous sections, Helios also

consists of hundreds of capacitors, resistors, and other simple components. To minimize area,
most resistors and capacitors are chosen in the 0402 package size. These are 1 mm × 0.5 mm
components, one size larger than the smallest available, but not so small that they require special
equipment to replace, or incur excessive purchase and assembly costs.
Most capacitors are present for the purpose of decoupling. High-speed integrated circuits
require an array of capacitors covering a range of capacitance values. The largest number of capacitors is devoted to the smallest capacitance values. A decreasingly smaller number of capacitors
is required for each larger capacitance value. This scheme is intended to effectively decouple the
power supply near each component for different frequencies of transient current demand. This is
particularly important for the FPGA since different parts of the FPGA can be operated at different
frequencies making its power needs much more variable than a typical integrated circuit. As a
result, FPGAs require a larger number and variety of capacitors than other chips.
Since most of the capacitors have small capacitance values, ceramic capacitors are used.
These are very inexpensive and small capacitors with very low ESR (Equivalent Series Resistance),
low inductance, and good high frequency response. The larger capacitance values often necessitate
a larger package than 0402, such as 0603, 1206, or other packages sizes. For the largest capacitance
values (e.g., ≥ 100 uF), tantalum capacitors are used due to their relatively small size. Capacitors
must also be carefully chosen to ensure that they have a sufficient maximum voltage specification
for the power supply they are decoupling.
Many of the resistors are present for the purpose of series termination for high-speed signals. The number of resistors required on Helios is greatly reduced, thanks to the digitally controlled impedance (DCI) feature of the Virtex-4. This feature allows the output impedance of all
I/O pins in a bank of the FPGA to be controlled with just two external resistors. As a result, we can
use two resistors in place of what might otherwise be 62 resistors for a single I/O bank. The other
high-speed components on the board, such as the SDRAM and SRAM, do not have this feature,
and therefore require termination resistors to be present on all drivers. This will be discussed in
more detail in Section D.5.1.

309

When possible, resistor arrays are used in place of discrete resistors. Resistor arrays are
special resistor packages that actually contain two or more resistors in a single package, allowing
more resistors to be placed in a smaller area. The disadvantage of these resistors arrays is that
they are much more difficult to solder than discrete resistors, increasing the likelihood of a board
assembly flaw.
Many other resistors are present as pull-up or pull-down resistors, which pull a signal high
or low, respectively, when not being otherwise driven, without the risk of a short circuit when
it is being driven. The remaining resistors are present mainly for the purpose of controlling the
behavior of the devices to which they are connected.

D.5

Printed Circuit Board Design
The actual printed circuit board (PCB) design, or layout, is perhaps the most challenging

part of a board design. Layout is particularly difficult in a board like Helios where we are attempting to put a number of high density components in a very small board area while keeping the
board cost to a minimum and ensuring good signal integrity. An introduction to PCB design for
the Virtex-4 FPGA can be found in [133].

D.5.1

Signal Integrity
The term signal integrity, or SI, refers to the quality of an electrical signal. If proper mea-

sures are not taken, the high frequency signals of the Helios board, such as the more than 100 MHz
signals of the SRAM and SDRAM, will become corrupted, can become susceptible to interference
from outside sources, or can be a source of interference to other electronics.
There are many factors that affect signal integrity, far too many to discuss here. For an
overview of designing high-speed digital signals, the reader is encouraged to consult Howard Johnson’s text on the subject [135]. Fortunately, when it comes to digital PCB design, creating signals
with relatively good signal integrity can be boiled down to several important rules of thumb.

310

Current Loops
One of the most important things a designer must do to ensure good signal integrity is to
provide quality current paths for all signals. Each signal on the PCB forms a complete loop, which
includes the driving chip, the PCB wire carrying the signal, called a trace, the receiving chip,
and the return path for the signal current through the electrical ground of the power distribution
system. This relationship is shown in Figure D.8. Novice designers often do not realize that the
signal integrity of a trace depends not only on the trace but also on the current return path, which
is usually not seen.

   

 


 

  

Figure D.8: PCB trace current flow. This example assumes the driver output is rising to a high logic level. Current flows in the opposite direction
when the driver output is falling.

There are a few potential problems that create poor current paths. First, the power supply
may be connected by a high impedance path. A high impedance path along the power and ground
connections for a chip prevents the power supply from properly powering the chip when its I/O
signals rapidly toggle. To alleviate this problem, decoupling capacitors are placed between power
and ground near the power supply pins of every digital chip. These capacitors effectively become a
local, low-impedance power supply for the chip. A high impedance path to ground can also cause
a voltage drop between the internal ground reference of a chip and the actual ground of the power
supply as current flows from the chip’s ground pins to the ground of the power supply. This can
disrupt proper data communication since all signals are referenced relative to the ground as seen
by the chip.

311

In addition, since the signal on a trace forms a loop including the ground connection between the two communicating devices, the cross sectional area of the loop must be kept as small
as possible. A high cross sectional area leads to a high inductance around the loop, which distorts
signals and increases the likelihood of ringing. A large cross-sectional area of the current loop can
lead to other problems as well. Any time current flows in a loop, a magnetic field is generated. This
changing magnetic field can then affect other nearby signals, which also form loops. This effect is
called mutual inductance. Such signal loops act like small antennas, spewing out electromagnetic
interference (EMI) and making the signals susceptible to interference from outside sources.
The rule of thumb to avoid this problem is to make the ground return path for every signal
as close as possible to the signal. This is achieved through the use of ground planes, as shown
in Figure D.9. PCBs are composed of metal layers stacked between insulating layers. A ground
plane is a metal layer dedicated to the ground of the power supply. In a high-speed digital system,
each signal layer runs adjacent to a ground plane. This simultaneously keeps the current loops as
small as possible while minimizing the inductance of the ground connection, in addition to greatly
simplifying routing of the board.
In a similar fashion, power planes are also used to further minimize power supply inductance. In addition to power and ground planes, designers must also keep the connections from each
chip to the power and ground planes as short and as thick as possible by using very short and wide
traces as well as by placing vias (the vertical connections between layers) as close to the power
and ground pins of the chip as possible. The same rules apply to the connection of decoupling
capacitors to the supplies they decouple.

   

 


 

 

Figure D.9: PCB trace with ground plane.

312

One detail that is often forgotten by digital PCB designers is that they must ensure that the
ground plane running under a trace is uninterrupted. Currents flowing on the ground plane always
seek the path of lowest inductance. This means that the return current for a trace will actually
travel in the ground plane directly under the signal trace, even as the trace twists and turns. As a
result, a large gap or slot in the ground plane underneath a trace will cause the return current path
to flow around the gap. This diverted current path creates a larger current loop, violating our rule
for maintaining a minimum cross-sectional area for the current loop. Such gaps are often called
split planes and may be present for a variety of reasons, including isolating an analog power supply
from a digital one. The solution to this problem is to route the signal trace around the underlying
gap in the ground plane, so that the return current is not forced to deviate from the path of the
signal current.

Impedance, Reflections, and Termination
High-speed signals propagating down a PCB trace behave much like waves in a tunnel of
water. A low-speed signal changing its voltage is analogous to slowly filling one end of the tunnel
with water. Doing so slowly changes the level of water along the entire tunnel. A high-speed signal
is analogous to dumping a large amount of water into one end of the tunnel. Doing so creates a
wave that then propagates down to the end of the tunnel where it will likely bounce back, creating
a smaller wave that returns to the source. Such waves on a PCB trace can bounce back and forth
multiple times, corrupting the signal, before the wave finally dissipates due to resistive loss. This
reflection is caused by an impedance mismatch between the trace and receiver as well as between
the driver and trace.
Whether or not a signal will manifest this problem depends on the length of the trace
relative to the rise/fall time of the signal and the propagation delay of the trace, which is a function
of the trace’s inductance and capacitance. A generally accepted rule of thumb is that the length of
the trace, L, should satisfy the relationship [135]
L<

Trise
.
6Tprop

313

(D.4)

The propagation delay for a trace on a PCB made from FR4 material is typically around
150 ps per inch. On Helios, with SRAM signal frequencies as high as 250 MHz, the rise time of
these signals can be expected to be under 1 ns. Using these numbers and Equation D.4, we find
that such traces need to be shorter than about one inch to avoid problems, a length that is pushing
the limits even for a small board like Helios.
To avoid reflections on PCB traces, several different termination techniques can be employed. Most termination techniques work by altering the output impedance at the driver or the
input impedance at the receiver to match the characteristic impedance of the trace. The simplest
and most power efficient way to terminate a trace is to place a series resistor at the source, as was
shown in Figure D.6(a) on page 292. This termination technique allows signal reflections to occur
at the receiver but absorbs the reflections when they return to the driver. The resistance of these
termination resistors must be chosen such that the output impedance of the driver plus the resistor
value equals the characteristic impedance of the trace. The output impedance of the driver is estimated from the IBIS (I/O Buffer Information Specification) or other I/O models provided by the
chip’s manufacturer.
In order for this technique to work, the characteristic impedance of all traces must be controlled. This impedance is a function of the trace width, thickness, and height above/below the
adjacent plane layers. Since the height of the trace from adjacent plane layers and the thickness of
the trace are fixed for the entire PCB layer, the designer is responsible for managing the width of
the traces. The needed trace width can be approximated using standard equations, such as those
provided by IPC-2251 [136].

Other Guidelines
A variety of other general guidelines, or rules of thumb, should also be followed when
designing the layout of a PCB. These include, but are not limited to the following.
• Isolate components from each other. This limits the amount of crosstalk and interference
between the devices. It is also wise to group high-speed components together and physically
separate them from analog or noise-sensitive components.

314

• Isolate analog power supplies from digital ones. Digital switching creates noise in the power
supply system which can be transmitted to analog devices. It is best to isolate the two power
supplies from each other, either by filtering the noisy power supply connection leading to
the analog devices or by using a completely separate voltage regulator. This also includes
isolating the power and ground planes, which may necessitate split planes.
• Do not route traces in right angles, since such angles act like small antennas and create an
impedance mismatch due to the non-constant width of the trace at the corner. Instead use
two 45◦ angles spaced slightly apart.
• Do not route high-speed traces along the edge of the board, since the incomplete ground
plane underneath will affect the characteristic impedance of the trace.
• Avoid running traces in parallel very close to each other for long distances, since this increases the crosstalk between the traces. Instead, space the traces further apart. A rule of
thumb is that the spacing between traces should be three times the width of a trace. This
rule is not as important if the traces are part of the same synchronous bus, since cross talk is
caused by signal transitions. Intra-bus crosstalk has time to settle before the bus is sampled
by the receiver.
• Follow the power supply decoupling guidelines for each chip. This generally means placing
a number of X7R and/or X5R ceramic decoupling capacitors as close as possible to each
of the chip’s power supply pins. Each board power supply, as a whole, also needs larger
decoupling capacitors, which can be placed much further from the chips for which they
provide decoupling. Proper decoupling reduces power supply noise and compensates for the
inductance inherent in the power supply that might otherwise lead to brown-outs on a chip.
Low ESR capacitors in small packages are best since they have lower effective resistance
and inductance. The traces connecting these capacitors to power and ground also need to be
as short and wide as possible, to keep the inductance and current loop size to a minimum.
When connecting larger capacitors to the power and ground planes, multiple vias may be
used for the same reasons.
• Ensure that power traces are wide enough for the current they are intended to carry.
315

• Alway keep in mind that a PCB is a three-dimensional object, although it is usually designed
in two dimensions. This means verifying that component heights and dimensions will not
interfere with components on boards stacked above and verifying that there is sufficient
clearance for connectors and other components that require interaction.

D.5.2

Layer Stackup
One of the most important considerations when designing a PCB is called the layer stackup.

This includes the order in which layers are stacked, the thickness of each metal layer, and the
amount of separation between metal layers. For additional fees, PCB manufacturers can provide
customized stackups, where the designer simply specifies the impedance desired for a given trace
width on each layer and the manufacturer fine tunes the stackup to achieve the desired characteristics. For a reduced cost, the designer can also choose from a standard stackup provided by the
manufacturer.
The first decision to be made is how many layers are needed for the PCB in question,
keeping in mind that up to half of the layers will be dedicated to power and ground planes. This is
not only important to be able to route all the signals in the small board space provided but it’s also
an important economic decision. As the number of layers is increased, the cost of each PCB rises
dramatically, as shown in Figure D.10. Therefore, it is important to use the minimum number of
layers possible.
The target FPGA, the Xilinx Vertex-4 FX in the FF672 package, has 672 balls that must
be soldered to the PCB, with about 320 I/O signals that must be routed to various parts of the
PCB. Most of the remaining balls connect to various power and ground voltages. At the time of
this writing, a typical minimum trace width for a PCB manufacturer is around 5 mils (a mil is one
thousandth of an inch). As a result, it is simply not possible to route all of these signals on a single
layer, since only one trace can fit between adjacent balls of the BGA package. Instead, vias are
placed between balls so that most of the balls can connect to other signal layers in the PCB.
As it turns out, it takes four signal layers to escape route the 320 I/O signals. Four signal
layers require at least two ground planes so that each layer has an adjacent ground plane, unless
some of the layers can be dedicated to low-speed signals, which is not practical for a compact
board like Helios. In addition, the FPGA requires three voltages—a 1.2 V core voltage, 2.5 V
316

300

PCB Cost (USD)

250

200

150

100

50

0

0

2

4

6
PCB Layers

8

10

Figure D.10: Helios cost per PCB versus the number of metal
PCB layers, assuming purchase of ten 90 mm × 65 mm boards.
Source: www.4pcb.com.

for VCCAUX and the 2.5 V I/O banks connected to the SRAM and SDRAM, and 3.3 V for the
remaining I/O banks. Due to the relative location of these power pins underneath the FPGA, two
power planes are required to properly connect all of these voltages. Therefore, in total, Helios
would require eight metal layers. A nine or ten-layer board could also be used, making the routing
much less difficult but with a 51% increase in per PCB cost. A six-layer board would reduce PCB
cost but would require us to break many of the PCB design rules for good signal integrity described
in Section D.5.1.
To reduce costs, we decided to use the standard 8-layer stackup provided by our manufacturer, which provides an industry standard 0.062 inch overall thickness. Trace widths on the signal
layers were carefully controlled to ensure the proper characteristic impedance for the high-speed
traces. Since the signal layers should be placed adjacent to ground planes, the stackup shown in
Figure D.11 was used. Other stackup orders could be used, each with different trade-offs with
respect to signal integrity and EMI qualification. The stackup shown allows high-speed traces
to be routed on all layers, which we felt would greatly simplify the routing of Helios. The two
main disadvantages of this stackup are that the high-speed signals placed on the top and bottom

317

layers of the board are more likely to generate EMI and the stackup does not benefit from the increased decoupling capacitance created when the power and ground planes are placed adjacent to
one another.
Top Signal Layer (1 oz)
2 Sheets 1080 Prepreg (5.4 mil)
Ground Plane (0.5 oz)
Core (10 mil)
Inner Signal Layer (0.5 oz)
2 Sheets 1080 Prepreg (5.0 mil)
Power Plane (0.5 oz)
Core (10 mil)
Power Plane (0.5 oz)
2 Sheets 1080 Prepreg (5.0 mil)
Inner Signal Layer (0.5 oz)
Core (10 mil)
Ground Plane (0.5 oz)
2 Sheets 1080 Prepreg (5.4 mil)
Bottom Signal Layer (1 oz)

Figure D.11: Helios PCB layer stackup.

In addition to the metal and dielectric layers, Helios, like most complex PCBs, also has
solder mask and silk screen layers on the top and bottom of the board. The solder mask, or solder
resist, is a polyimide film layer that provides a permanent protective coating over the PCB, preventing accidental contact with the metal on the surface of the PCB. This is the coating that gives
most PCBs their green color, although other colors are available. The only parts of the PCB left
exposed are the component pads, which are the metal contacts where the pins of each component
are actually soldered to the board. Since only the points to be soldered are exposed, the solder
mask greatly facilitates the soldering process by preventing solder from being wicked away from
the pad by the connecting trace. This is also critical for wave soldering, a high volume, board
soldering processes, where molten solder is allowed to flow over the entire surface of the board but
only adheres to the exposed pads and metal contacts.
Each silkscreen layer is a legend printed on the board to identify locations and components.
This is for ease of use and debugging and is not critical to the function or assembly of the board.
318

For example, each button, switch, and LED is clearly labeled for convenience. Additionally, each
component is numbered to facilitate identification.

D.5.3

Final Layout and Organization
The final component layout of Helios is shown in Figure D.12, which shows the top side

of the Helios board. All of the main components are placed on the top side since there is no real
benefit to placing large BGA components on the bottom side of the board, unless blind and buried
vias can be used. We reduced the size from the earlier concept drawings (e.g., Figure D.2) after
the size and layout constraints of each component became more clear. In the end, the length of
the expansion header dictated the width of the Helios board. The final dimensions of Helios are
65 mm × 90 mm, or about 2.5 in × 3.5 in, making it about the size of a deck of playing cards.
Several constraints influenced the layout of the board in addition to those already discussed
in Section D.5.1. Perhaps most importantly, the components of the board must be carefully arranged so that components sharing the same voltages are placed in the same region of the board.
This greatly facilitates the design of the power distribution system. In fact, without careful consideration, the routing of power would not be possible with the limited number of power plane layers
available for the distribution of power.
Another important factor that affects the layout is usability. Helios is designed to be expanded through the vertical stacking of daughter boards. This means that buttons, switches, LEDs,
and connectors placed in the middle of the board would be inaccessible. Instead, all buttons,
switches, connectors, and LEDs needed to be at the edge of the board. Additionally, Helios is
intended to be embedded into small systems. As a result, it would often be put into compact areas,
making it more difficult to gain access to these essential interfaces. In order to minimize the difficulty in accessing these various features, we decided that all buttons, switches, and LEDs would be
placed along one short edge of the board, along with the power connector. All three data interfaces,
including the USB, JTAG, and serial port, would be placed along one long edge of the board. This
meant that only two edges of the board needed to be made accessible to provide full access to all
the boards interfaces. Unfortunately, this further complicated the layout of the board.
The final consideration is the partitioning of I/O signals among the I/O banks of the FPGA.
All I/O pins on the Virtex-4 FX FPGA in the FF672 package are divided among 9 different banks.
319

-.



/&





)*

)

 !





$
"



   

 !


+, 

 









'#'%

"#$%

$#&%

(

(

(a) Component Locations

(b) Helios Board Prototype

Figure D.12: Helios board layout.

Each bank has its own power supply, which requires that all pins in an I/O bank share the same I/O
voltage. Additionally, all I/O pins in the same bank that use the DCI feature must be configured to
have the same output impedance.
There are two different sizes of banks on the FPGA. There are 6 small banks (bank numbers
1–4 and 9–10) with 16 I/O pins each and 4 large banks (bank numbers 5–8) with 64 I/O pins each,
giving a total of 352 user I/Os. However, banks 9 and 10 are only available on the FX40 and FX60,

320

not on the FX20. These banks are not used on Helios so that all three FPGAs could be used, thus
reducing the total number of available I/Os to 320.
Due to the bank restrictions, each device must be matched with one or more I/O banks
having the correct voltage and impedance capabilities. The partitioning of I/Os among the banks
for Helios is shown in Figure D.13, which shows how the larger components are connected to the
FPGA. In summary, the SRAM and SDRAM are each allotted their own large I/O bank, a third
large bank is reserved for the flash and USB chips, and the fourth large bank is dedicated to the
expansion header. The smaller banks are dedicated to clocks, LEDs, buttons, switches, the serial
port, as well as a few low-speed signals for the SRAM and flash. Most of bank 2 and much of
bank 1 is left unused.

Serial
Platform
Flash

USB

JTAG

USB

EE

CLK

B5
3.3V

B9
NC

B7
2.5V

B0
2.5V

SRAM

Flash

Expansion Header

CLK

B3
B1
B2
B4
3.3V 3.3V 1.2V 2.5V 2.5V
Core

B6
3.3V

B10
NC

B8
2.5V

SDRAM

Figure D.13: Helios board organization.

Viewed from above (Figure D.12), the layout of Helios can roughly be divided into three
regions. On the left we find the expansion header, clocks, flash, and USB, all of which require the
3.3 V supply. In the middle is the FPGA, which has a 1.2 V core voltage. On the right we find the
SDRAM and SRAM, which run on 2.5 V. This partitioning was intentional and the placement of
the three primary voltage regulators correspond to this partitioning. In addition to the 1.2 V core
voltage of the FPGA, the I/O banks on the left side of the FPGA are powered by the 3.3 V supply

321

(for communication with the expansion header, flash, and USB) and the I/O banks on the right side
of the FPGA are powered by the 2.5 V supply (for communication with the SRAM and SDRAM).
To accomplish this power distribution, the first power plane is dedicated primarily to the
1.2 V FPGA core voltage, but is also used to provide a separate 3.3 V plane for the USB, a 1.8 V
plane for the Platform Flash, and to route the 2.5 V supply to the expansion header for use by
daughter boards. This is shown in Figure D.14(a). The second power plane is dedicated to the
3.3 V supply on the left and the 2.5 V supply on the right, as shown in Figure D.14(b).

D.6

Production
After the design of Helios was finalized, the actual production of boards could begin. Due

to the high cost of building a custom board, we decided that we would minimize the risk in our
initial build by only building one board that would be carefully tested before committing funds to
building more boards. In retrospect, it might have been wiser to build at least two boards, so as
to not put all the risk into a single board. However, this turned out to be the right decision since
the first board worked and led to a few minor corrections in the next revision. The resulting Helios
board is shown in Figure D.15.
The production of boards, such as Helios, turned out to be much more involved than we
initially anticipated. The production consists of a few seemingly simple steps, including acquiring
all the components, ordering the PCBs, and then the actual assembly of those parts.
The PCBs were manufactured by Advanced Circuits (www.4pcb.com) in Colorado, USA,
which specializes in low-volume prototypes. Due to the complexity of the Helios PCB, we were
required to carefully communicate with the PCB manufacturer to ensure that each PCB would meet
the specifications for which Helios was designed.
Part acquisition is particularly complicated due to the availability of the needed components. As discussed in Section D.2.6, many of the key components needed for Helios are difficult
to find in the low quantities needed. As a result, each Helios build required us to scramble to find
distributors who either stocked the parts or was willing to order them in large quantities and sell us
only a small number. In all, Helios has 327 different components and 77 unique part numbers that
must be acquired from several different distributors prior to assembly. Many assemblers recog-

322



 





 












   

 !


 !


 

 









 


(a) Power Plane 1





   

 !

 !








(b) Power Plane 2

Figure D.14: Helios power planes.

323



Figure D.15: The Helios board.

nize the difficulty in acquiring parts and now offer turn-key services, in which the board assembler
acquires all required components at an additional cost.
Due to the types of high-density packages used on Helios, such as the BGA packages,
Helios could not be soldered by hand. Instead it required special equipment, such as placement
machines, reflow ovens, and X-ray machines in order to properly place, solder, and verify the
installation of the components. Board assembly is an extremely error prone process that we felt
was best left to an experienced assembly company. The boards were assembled by a division
of MEC, called Screaming Circuits, in Oregon, USA. This company specializes in low-volume,
high-density PCB assemblies, such as Helios.
The final cost of each Helios board varies dramatically with the production volume. Although we were seeking to build only a small volume of Helios boards, it was to our benefit to
consolidate all Helios builds into single large orders in order to keep the per board cost down.
The variation in board price is directly related to the variation in PCB cost, assembly cost, and
component cost as volumes increase.

324

The manufacturing of a PCB requires a significant amount of setup before the first PCB
can be produced. Once the setup for each layer has been performed, each additional board can be
produced for a small additional cost. Additionally, the PCB manufacturing machinery is capable
of producing a PCB panel of up to a specific size. This size is large enough that a single panel can
fit several Helios PCBs. As a result, a single Helios PCB essentially costs the same as several. The
relationship between per PCB cost and volume is shown in Figure D.16. The cost for ordering a
single PCB is nearly 16 times more expensive than the per board cost when 20 PCBs are ordered.

1600
1400

PCB Cost (USD)

1200
1000
800
600
400
200
0

0

2

4

6

8

10
12
14
Quantity Ordered

16

18

20

22

Figure D.16: Helios per PCB cost versus total quantity ordered for an
8-layer, 90 mm × 65 mm board. Source: www.4pcb.com.

The assembly has similar setup overhead costs. For each assembly job, the components to
be installed on the board must be organized and installed into machinery. Additionally, a solder
stencil must be manufactured for the top and bottom layers of the PCB. These stencils will be
used to accurately apply solder paste to the component pads on each PCB. Once this initial setup is
complete, each additional board can be assembled for a much lower cost. The relationship between

325

per board assembly cost and volume is shown in Figure D.17. In the case of assembly, it is nearly 5
times more expensive to assemble a single board than the per board cost of assembling 20 boards.

1400

Per Board Assembly Cost (USD)

1200
1000
800
600
400
200
0

0

2

4

6

8

10
12
14
Quantity Ordered

16

18

20

22

Figure D.17: Helios assembly cost, per board, versus total quantity ordered. Source: www.screamingcircuits.com.

Component prices also benefit from volume. Each component order must be properly
counted and packaged for delivery. The more of a given component are ordered, the easier it
is to count the components, relatively speaking, since they are typically already organized into
quantities of various sizes. Also, the packaging becomes relatively less expensive with a large
number of ordered components. Unfortunately, the price benefits of increased component volume
for Helios are not as significant, since price breaks generally do not occur until tens, hundreds, or
even thousands of a component are purchased. For example, consider the cost of a single 256-Mb
SDRAM chip, shown in Figure D.18(a). Keeping in mind that each Helios board only needs one
SDRAM chip, the first price break does not occur until 25 units are ordered, where a 29% unit
price reduction is reached, and price breaks beyond that are not as significant. The relationship
for less expensive components, such as the capacitor with costs shown in Figure D.18(b), is a bit
326

different. In this case, the first price break occurs when 100 units are ordered, representing a 40%
drop in unit cost.
The final cost of Helios also depends on the model of FPGA, SDRAM, SRAM, and flash
chip chosen for installation, since Helios supports several different capacities of each. The FPGA
is by far the most expensive component, especially when a large FPGA such as the FX60 is installed. The overall cost of the Helios board is broken down into the various cost components in
Figure D.19. It should be noted that our costs were actually somewhat lower than those shown in
the figure since we received various academic and university discounts.

D.7

Helios Specifications
The final specifications for the Helios board are summarized in Table D.10. Note that

the actual specifications depend on the components installed on the Helios board. Several Helios
boards have been produced with different specifications due to the variety of options available. The
data in Table D.10 reports ranges based on the compatible parts that are available.

D.8

Daughter Boards
In order to enable the many applications that can be targeted by Helios (see Section D.10),

several daughter boards have been designed and produced. The first Helios daughter board was
called the GBV (Ground-Based Vehicle) board. This board was designed specifically for autonomous ground vehicles and is pictured in Figure D.20(b). The board features two camera ports,
four servo ports, a Zigbee (or other) wireless link, a digital compass, and general-purpose I/O
connectors for user-defined purposes.
For aerial vehicle applications, the AVT board (Autonomous Vehicle Tookit) was developed
and is shown in Figure D.20(a). This board supports most of the same essential features as the
GBV board but uses new camera ports for more advanced image sensors featuring a global shutter.
Additionally, it includes a small FPGA for video preprocessing, a video encoder to convert video
to NTSC for wireless transmission to a ground station, and a miniature 5-in-1 media card port for
the storage of still images or video on a flash memory card.

327

8

SDRAM Cost (USD)

7
6
5
4
3
2
1
0
0
10

1

2

10

10
Quantity Ordered

3

4

10

10

(a) 256-Mb SDRAM (Micron MT48H8M32LFB5-75)

3

Capacitor Cost (1/100th USD)

2.5

2

1.5

1

0.5

0
1
10

2

10

3

10
Quantity Ordered

4

10

5

10

(b) 0.1 uF, 0402 Capacitor (Kemet C0402C104K8RACTU)

Figure D.18: Helios component costs vs. quantity ordered. Source: www.digikey.com.

328

Table D.10: Helios Specifications Summary
Physical Properties
Dimensions
Weight
Stacking Height
FPGA
Models
Package
Features
SRAM
Architecture
Capacity
Data Width
Operating Frequency
SDRAM
Architecture
Capacity
Data Width
Operating Frequency
Flash
Architecture
Capacity
Data Width
Initial Access Time
Platform Flash
Capacity
Data Width
Operating Frequency
Revisions
USB Peripheral
Speed
Receptacle
Interface Width
Operating Frequency
Expansion Header
Positions
Pitch
Available I/O
Power
Power Supplies
Type
Ratings

65 × 90 × 12.5 mm (2.5 × 3.5 × 0.5 in)
37 g (1.3 oz)
5 mm bottom and 9 mm top minimum, up to 16 mm
Xilinx Virtex-4 FX20, FX40, or FX60
FF672 (27 × 27 mm, 672 ball, 1 mm pitch BGA)
See Tables D.2 and D.3
ZBT, Pipelined or Flow-Through, 2.5 V
9, 18, 36, or 72 Mb (1 to 8 MB approx.)
36-bit
Up to 250 MHz
Mobile SDR SDRAM, 2.5 V
128, 256, or 512 Mb (16 to 64 MB)
32-bit
Up to 133 MHz (CL = 3)
Asynchronous, CFI Compatible, NOR Flash
32, 64, or 128 Mb (4 to 16 MB)
16-bit
As low as 75 ns
8, 16, or 32 Mb
Serial
Up to 33 MHz
Up to 4 configurations
Full (12 Mbps) or High (480 Mbps)
Mini-B
16-bit
Up to 48 MHz
120
0.8 mm
64
3.3 V, 2.5 V, Ground
On-board, high-efficiency switching
See Table D.9
329

Assembly (31.5%)

FPGA (23.1%)

Assembly (17.4%)

Other Parts (7.5%)
Flash (2.9%)
RAM (3.7%)
USB (1.7%)

PCB (12.8%)
PCB (23.3%)

USB (0.9%)
RAM (2.1%)
Flash (2.2%)

FPGA (57.1%)

Other Parts (13.7%)

(a)

(b)

Figure D.19: Helios board costs breakdown, assuming a quanity of six boards. (a) With FX20,
$1087 total board cost. (b) With FX60, $1973 total board cost.

A third daughter board was also developed for video and image processing research and
is shown in Figure D.20(c). In addition to daughter boards, several image sensor boards have
also been developed to support a variety of CMOS image sensors, such as the 1/4 in, color, VGA
resolution, CMOS sensor shown in Figure D.20(d).
Most of these boards are very simple in nature, consisting of a handful of connectors, a
two or four layer PCB, and a few integrated circuits. This makes the daughter boards much less
expensive to build than Helios and simple enough that they can be assembled and soldered by hand
in-house.

D.9

Intellectual Property
Helios, combined with one or more daughter boards, makes an excellent hardware platform

for machine vision research and the implementation of machine vision systems. However, this
platform, by itself, is not very useful without a substantial amount of infrastructure to facilitate use
of these boards. This infrastructure includes hardware modules to be run on the FPGA as well as
software to be run on the CPU. Together, the custom hardware modules and software running on
Helios implement a working system. The hardware modules that run on the FPGA are commonly
referred to as intellectual property cores, or simply IP cores.
Many of the IP cores needed to create a working system on Helios are freely available as
part of the Xilinx Embedded Development Kit (EDK). This includes system buses, bus bridges,
330

(a) AVT Board

(b) GBV Board

(c) Video Board

(d) Camera Board

Figure D.20: Helios daughter boards.

memory controllers, reset controllers, clock managers, interrupt controllers, communication interfaces, common serial interfaces (UART, I2 C, and SPI), and many other standard components.
In order to support image processing as well as the control of various vehicle platforms, a
large amount of intellectual property has been developed specifically for use with Helios and its
daughter boards. This intellectual property exists in the form of VHDL code for hardware modules
and C code for software modules. Some of the hardware modules developed in the Robotic Vision
Laboratory for use on Helios include the following:
• Camera interfaces, to acquire images and video from standard CMOS sensors.
• Demosaicing modules, to convert raw image sensor data to RGB pixel values using various
interpolation and filtering techniques.

331

• Color correction module, to correct the image sensor’s output for more accurate color reproduction.
• Video interlacer, to convert from progressive scan to interlaced video.
• Gamma correction module, to correct the signal intensity of raw pixel values or to prepare
pixels for CRT display.
• Color space conversion modules, to convert from RGB to HSV or RGB to YCbCr color
spaces.
• Image derivative module, to compute horizontal and vertical image gradients.
• High-speed USB interface, for reading and writing data over the high-speed USB port.
• Floating-point unit (FPU), for performing floating-point arithmetic on the MicroBlaze or
PowerPC processors.
• DMA controllers, to automate data transfers between custom peripherals (such as the camera
interface) and system memory on the PLB or OPB system buses.
• Servo controllers, to control servo positions on both ground and aerial controls.
• Quadrature decoder module, to decode quadrature encoded signals for motor position feedback on wheeled robotic vehicles.
Each of these VHDL cores is highly parameterizable, allowing us to tailor its features,
behavior, and data widths to the application. This flexibility is essential for module reuse and the
minimization of wasted hardware resources. Most hardware cores also have an associated driver
or support software written in C, allowing the module to be controlled or managed by the CPU on
the FPGA.
The hardware modules listed above serve primarily as support hardware that make it possible to use Helios in the intended systems. Each application will also generally include one or more
image processing cores as well as software for managing the system. The nature of these image
processing modules and the control software depends on the application.

332

D.10

Applications
Helios has been used for a variety of applications involving both ground and air vehicles.

One of the first applications involved replicating an existing robotic system which used on-board,
machine vision on a small, two-wheel robotic platform measuring 17 cm long × 20 cm wide ×
14.5 cm tall, shown in Figure D.21 [104]. This robot used a feature tracking algorithm to perform
3D reconstruction for the purpose of identifying and avoiding obstacles. The computing platform was switched from our previous FPGA development board, which hosted a Xilinx Virtex-II
XC2V1000 FPGA and used the MicroBlaze soft processor core, to a Helios board with a Virtex4 FX20 FPGA and the PowerPC. With only minor design changes, the switch to Helios allowed us
to increase image processing throughput 2.26 times while reducing power consumption to 1.27 W,
a reduction of about 72% [137]. Even after these improvements, the design used only about 53%
of the logic resources available on the FX20, the smallest FPGA supported by Helios. Table D.11
summarizes the total power consumption of Helios for this application, including the power consumed by the GBV daughter board and camera, using various system clock frequencies and camera
frame rates. Notice that the total system power consumption for this application is much lower than
the original target of 5 W for Helios.

Figure D.21: Small robotic research platform.

333

Table D.11: Helios Power Consumption for 3D Reconstruction Application
Power (W)
1.17
1.27
1.33
1.39

FPGA Speed (MHz)
75
75
100
100

CPU Speed (MHz)
75
75
300
300

Camera Rate (fps)
15
34
15
34

Several 1/10th scale R/C trucks have also been used as the vehicle platform for Helios, as
shown in Figure D.22(a). These trucks were primarily used as a platform for undergraduate student
learning in the Robot Racers senior project at Brigham Young University. Students used Helios
with these trucks to develop a vision and control system that would allow the trucks to navigate a
color-coded race course (Figure D.22(b)), complete with hair-pin turns, orange cone obstacles, and
jumps [138]. This vehicle platform was also used with Helios, as well as other FPGA development
boards, to study target tracking for the purpose of vehicle following [139].

(a)

(b)

Figure D.22: Robot Racers senior project. (a) 1/10th scale truck platform with Helios
and GBV daughter board. (b) Senior project race track.

Helios has also been used on aerial vehicles. Helios was combined with a Kestrel Autopilot
on a Zagi air frame to study targeted landing of small UAVs [140]. This same UAV platform was
the target for research into miniaturized, see-and-avoid technologies for UAVs, which use on-board
334

vision to detect and avoid other aerial vehicles [141], a technology critical to the future use of UAVs
in crowded skies. In addition to fixed-wing UAVs, Helios has also been employed on a quad-rotor,
hovering aircraft that was used in the study of image-based flight stabilization and control [142],
[143].
In addition to unmanned vehicle research, Helios is also currently being used in the development of improved optical flow algorithms suitable for FPGA implementation [144], [145]. Such
an algorithm could later be used in an autonomous vehicle implementation.
These research projects show the diversity of applications for which Helios can be used.
The combination of a high-performance FPGA and a general-purpose embedded processor in a
small, low-power package make Helios well-suited for a variety of applications requiring highperformance processing in a small space.

D.11

Limitations and Future Improvements
Helios has turned out to be a very capable platform for real-time, machine vision and

autonomous vehicle research in the Robotic Vision Laboratory. Unfortunately, generally speaking,
no matter how fast and capable a computer system is, it is never quite fast enough. Having a faster,
more capable platform in the form of Helios has only allowed us to imagine new ways to take
advantage of the increases in performance and resources. The first Helios boards produced were
built with Xilinx Virtex-4 FX20 FPGAs in the -10 speed grade, the smallest and slowest FPGA
supported by Helios. We quickly found ourselves building additional boards with larger and faster
FPGAs, such as the FX60 in the -11 speed grade, in order to take advantage of the increased
performance and capacity.
However, in my experience, at least in an academic setting, the need to move to a larger and
faster FPGA is not motivated so much by the limitations imposed by the smaller FPGA but rather
seems to be motivated primarily by the desire to keep design effort to a minimum. For example,
some less experienced students made the transition from the FX20 to a larger and faster FPGA
as soon as they ran out of resources or failed to meet timing constraints, with little or no thought
being given to reducing the size or improving the timing of their existing design. As a result, larger
and faster FPGAs have become a crutch for these researchers. Instead, much more effort needs
to be made in the early design stages in order to minimize the logic needed and to ensure that the
335

design uses FPGA resources efficiently. Nevertheless, many designs will require a larger amount
of FPGA resources and will benefit from the faster timing of the more expensive FPGAs.
The next version of Helios (version 2.0) will undoubtedly take advantage of the scaling of
Moore’s law by using newer technology, such as the Xilinx Virtex-5 FXT FPGA. This FPGA is a
65 nm device with many improvements over the Virtex-4 FX. Among these, the PowerPC 405 has
been replaced by a PowerPC 440 processor, which features a superscalar architecture, larger 128bit bus interfaces, integrated hard DMA controllers, and larger instruction and data caches. Additionally, the processor features an integrated, non-blocking, 128-bit, 5×2 crossbar. This crossbar
allows the system bus to be partitioned and separated into multiple, high-bandwidth data buses
or point-to-point channels, dramatically increasing the overall throughput available on the system
buses.
One challenge of designing a system using Helios, and any other system that combines
hardware and software, is the balancing of the work done on the general-purpose processor and the
work done in custom hardware. At times, it seems most convenient to do some preprocessing in
hardware, then some processing in software, followed by some post-processing in hardware. This
requires data to be moved into and out of the processor’s memory multiple times. This not only requires a high-performance memory, but also a high-performance bus with efficient DMA transfers,
both of which consume significant hardware resources. Due to the overhead associated with bus
transactions, we have encountered some applications where we have approached or exceeded the
capabilities of the 64-bit PLB connecting the PowerPC to the high-performance peripheral cores
and memory interfaces. Careful management of the bus is required in these applications. Fortunately, the performance limitations of the shared, 64-bit PLB will be greatly alleviated by the
increase in bus size to 128 bits, the addition of the crossbar and integrated DMA controllers, as
well as the increased bus partitioning available in the Virtex-5 FXT.
Another limitation of Helios is the number and bandwidth of available external memories.
We have found that many machine vision systems require an external memory at more than one
stage in the system’s image processing pipeline. For example, one such system used the SRAM
chip as image storage for a hardware feature tracking module. The same system also needed a
similarly sized external memory to interlace video for conversion to NTSC. In order to do both, we
require either two separate memories or a single fast memory that can be shared by both hardware
336

modules. Separate SRAM memories would be the easiest solution, but would require a larger
board and would likely consume more power. A single, shared memory would require a multi-port
memory controller to manage the sharing and would increases the memory latency.
Helios 2.0 could support this situation by switching from a single x36 SRAM to two x18
SRAM chips. This would provide the same overall SRAM bandwidth but would facilitate applications that require more than one SRAM memory. Alternatively, two x36 SRAM chips could
be used. Either way, the board area and power consumption would be increased. Another option
would be to switch from an SDR SRAM to a DDR or QDR memory. These chips offer higher
performance and better power/performance ratios, but would require the design of a compatible
multi-port memory controller. This issue requires more careful consideration and analysis before
a final decision can be made.
Another improvement to the memory system for Helios 2.0 would be the transition from
Mobile SDRAM to Mobile DDR or DDR2 SDRAM. This would dramatically improve the performance of the high-capacity DRAM without an increase in board size. It would, however, increase
power consumption, but the power/performance ratio would likely be improved.
The switch to higher performance memories would also necessitate the use of a more advanced PCB design tool. For Helios 1.0, the OrCAD Capture and Layout tools, part of Cadence
PSD v15.1, were used. Although considered to be a high-quality PCB development tool, we found
it to be largely outdated for a modern, multi-layer PCB such as Helios. A more sophisticated tool
with native support for multiple power and ground planes, signal length matching, and integrated
signal integrity simulation will be essential for Helios 2.0.
Another potential improvement to Helios would be the the use of integrated power modules,
such as the µModule regulators manufactured by Linear Technology. Not available in such small
sizes when Helios 1.0 was developed, these are fully integrated power supply modules ideal for
powering FPGAs and other high-current, digital devices. Use of such modules would greatly
simplify and improve the quality of Helios’ power supplies, while also allowing for higher FPGA
power consumption, if needed.
Helios 1.0 was designed only for operation without a heatsink or fan. The large capacity of
the FX60 has raised concerns that the FPGA could overheat. The power consumption of an FX60
can easily exceed the 6 W limit at which the FPGA can operate correctly at room temperature
337

without a heatsink or fan. Many small and low profile heatsinks, both with and without fans, are
available. Helios 2.0 should be able to accommodate such a cooling device. The challenge with
Helios is that many Helios applications involve UAVs, which often crash land. A cooling device for
Helios would need to be able to withstand the high forces associated with such an impact without
being dislodged or damaging the board.
Several other less significant limitations have also become apparent. At least one unusual
application required more I/O signals on the expansion header than were provided, although for all
other applications the 64 I/O signals have been sufficient. The lack of additional clock-capable I/O
signals on the expansion header has also made certain designs fail, although simple workarounds
are usually available.
Helios 2.0 could also benefit from improved signal integrity across the expansion header.
Although this has not yet been a problem, some daughter boards might require higher data rates to
the Helios FPGA. High performance headers such as the Q Strip series by Samtec, which were not
readily available at the time Helios 1.0 was designed, would be ideal.
Implementing all or most of these improvements may require a larger board than the current
Helios. Several solutions exist to deal with this problem. One is to increase the number of layers
on the PCB from 8 to 10 layers. Such a change would allow for more internal routing, leaving the
top and bottom layers available for components to be more compactly placed. Blind and buried
vias could also be used to allow for higher component placement densities. Unfortunately, both of
these options significantly increase PCB manufacturing costs. Another option is to further partition
Helios into multiple boards. For example, one board could be dedicated to the FPGA and power
supplies while a second board could contain the memories and communication interfaces. This
would make the board outline of Helios smaller at the expense of increased system height. Such
an arrangement could also increase the flexibility of Helios by allowing a single FPGA board to be
used with a wider variety of daughter boards.
The large number of trade-offs involved in each design decision makes it very difficult to
quantify the value of such design decisions in a meaningful way. This is further complicated by
the fact that the optimal choice for nearly every decision is application dependent. As a result, the
design of a board such as Helios can be viewed as an art more than a science. Yet the value of a

338

modular, compact, high-performance computing platform based on an FPGA and general-purpose
processors is clear from the variety of applications for which Helios has been employed.

339

340

APPENDIX E.

GLOSSARY OF ACRONYMS AND ABBREVIATIONS

Acronyms and abbreviations are used heavily throughout this text. Most acronyms and
abbreviations are defined the first time they are used in the text and this knowledge is assumed
thereafter. Additionally, understanding of many common acronyms and abbreviations, such as
standard SI units and basic computer terms, is assumed. In order to facilitate the identification of
important acronyms and abbreviations in the text, the majority of them are defined in this appendix,
including some not defined in the text. Definitions are given in the context of this work. These
acronyms and abbreviations may have different meanings in other contexts or disciplines.
One important distinction to be made in this text is the difference between the abbreviations
b (bit) and B (byte). A byte (B) consists of exactly eight bits (b). Thus, 2 MB is equal to 16 Mb.
Another common point of confusion is the meaning of the SI prefixes kilo- (k), mega- (M),
giga- (G), and so forth. In the context of computer data, these prefixes can mean, respectively, 210 ,
220 , and 230 (binary prefixes) or they can mean 103 , 106 , and 109 (decimal prefixes) depending
on the context. In this text, when referring to memory chip capacities or the size of data, these
prefixes are used refer to the powers of 2 (binary prefixes) since memories are addressed by binary
numbers. For example, a 1 Mb (megabit) memory contains 220 bits of data, which is the same as
128 kB, or 128 · 210 bytes. However, when referring to data transmission rates, these prefixes will
refer to the powers of 10 (decimal prefixes). For example, the 480 Mbps (megabits per second)
transmission rate of USB means that it can transfer 480 million bits per second. Fortunately, either
interpretation usually leads to numbers that are similar in magnitude.

E.1

Acronyms and Abbreviations

2D: Two dimensional.
3D: Three dimensional.

341

ALU: Arithmetic and Logic Unit. A circuit which performs basic arithmetic and logic operations,
such as addition, subtraction, and bitwise Boolean operations.
AVT: Autonomous Vehicle Toolkit. Refers to a Helios daughter board that was targeted for use
on aerial vehicles but could also be used on ground vehicles as a replacement to the GBV board.
AW: Adaptive Window. A stereo correlation technique proposed by Kanade and Okutomi [43].
b: Bit, or binary digit. This abbreviation is commonly used to refer to a single bit, as opposed to
a byte (B).
B: Byte. Commonly refers to an octet, or a collection of eight bits, although historically the size
of a byte may vary.
BGA: Ball Grid Array. A package for integrated circuits in which the metal contacts to the chip
are formed by an array of tiny solder balls which are melted to the PCB.
BRAM: Block RAM. Refers to the memory blocks integrated into the fabric of an FPGA.
CAN: Controller Area Network. A standard network protocol and bus standard for communication between microcontrollers and devices, originally intended for automotive applications.
CCD: Charge-Coupled Device. An imaging technology commonly used in electronic cameras. In
embedded applications, CCD image sensors are largely being replaced by CMOS image sensors,
which can take advantage of standard CMOS processes and allow for higher levels of integration.
CIF: Common Intermediate Format. Generally refers to the standard 352×288 image resolution.
CLB: Configurable Logic Block. The basic configurable unit of a Xilinx FPGA. Other vendors
typically use a different name. A Xilinx CLB typically contains a switch matrix, several LUTs, a
matching number of flip-flops, and other logic.
CMOS: Complementary Metal Oxide Semiconductor. The predominant class of integrated circuit
technology used in most digital devices. The same process can also be used to produce image
sensors.
COTS: Commercial, Off-The-Shelf. Refers to technology that is readily available for purchase.

342

CPU: Central Processing Unit. A common term for a general-purpose, software-programmable
processor.
CRT: Cathode Ray Tube. A class of display devices employing a vacuum tube, electron gun, and
a fluorescent screen, which includes the common non-flat-panel television screen.
DCI: Digitally Controlled Impedance. A technology available on Xilinx FPGAs in which the
output impedance of the I/O drivers can be controlled to match a specific characteristic impedance.
DCM: Digital Clock Manager. A block available on Xilinx FPGAs used to generate new clock
frequencies from input clocks and manage clock distribution.
DCR: Device Control Register. A simple, IBM defined bus for on-chip communication between
the CPU and peripheral devices for the purpose of device configuration. One of the standard buses
supported by the Xilinx EDK.
DDR: Double Data Rate. A digital signaling method in which data is clocked on both the positive
and negative edges of a clock, effectively doubling throughput.
DED: Disparity Estimate Density. The percentage of pixels that were assigned a disparity estimate.
DIMM: Dual In-line Memory Module. A small circuit board with several RAM chips and a
standard interface, commonly used in PCs.
DIP: Dual In-line Package. An electronic device package in which dual rows of contacts are used
to connect the device to a PCB.
DMIPS: Drystone MIPS. A benchmark based on the popular Drystone benchmark program, a
synthetic, integer-only benchmark for CPU performance. Designed to give a more meaningful
result than MIPS, the DMIPS measurement normalizes the performance of the benchmark relative
to the performance of a 1 MIPS VAX 11/780 computer. See also MIPS.
DRAM: Dynamic RAM. The dominant form of high-density, random access memory. Each bit is
stored as a charge on a tiny integrated capacitor. It is called dynamic because the charge must be
periodically refreshed in order for the capacitors to maintain their data.

343

DSP: (1) Digital Signal Processor. A software programmable processor optimized for digital
signal processing operations. (2) Digital Signal Processing. The type of processing used to operate
on digital signals, such as digital audio and digital image data.
EDK: Embedded Development Kit. The software suite provided by Xilinx to facilitate the design
of FPGA systems with integrated soft or hard processor cores.
EEPROM: Electronically Erasable Programmable Read-Only Memory. A small, non-volatile
memory that can be electronically erased and reprogrammed.
EMI: Electromagnetic Interference.
ESR: Equivalent Series Resistance. A measurement commonly used for capacitors to quantify
the amount of resistive loss that occurs when current flows into and out of a capacitor.
F: Farad. The SI unit of measure for capacitance.
FIFO: First In, First Out. Generally refers to a digital circuit which buffers data in a FIFO order.
FLOPS: Floating-Point Operations Per Second.
FPGA: Field Programmable Gate Array.
fps: Frames Per Second. Refers to the number of images per second in a video.
FPU: Floating-Point Unit. A digital circuit which performs floating-point arithmetic.
g: Gram. The SI unit of measure for mass.
Gb: Gigabit. Can refer to 230 or 109 bits, depending on context.
GB: Gigabyte. 230 or 1,073,741,824 bytes (B).
GBV: Ground-Based Vehicle. Refers to a Helios daughter board targeted for use on ground-based
vehicles, as opposed to aerial vehicles.
GFLOPS: Giga-FLOPS. One billion FLOPS.
GHz: Gigahertz. 109 hertz (Hz).
GPP: General-Purpose Processor.
344

GPS: Global Positioning System. Generally refers to a device which receives signals from Earth
orbiting satellites in order to calculate its position on the Earth’s surface.
HDL: Hardware Description Language. A computer language for the formal description of electronic circuits.
HSV: Hue, Saturation, Value. A color space in which an image pixel is represented by measures
of the three parameters.
Hz: Hertz. The SI unit of measure for frequency or cycles per second (1/s).
I2 C: Inter-Integrated Circuit. A low-speed, shared serial bus invented by Philips that is used to
communicate between integrated circuits.
IBIS: I/O Buffer Information Specification. A simple model that describes the behavior of I/O
buffers on integrated circuits and is usually used to perform signal integrity simulations.
IC: Integrated Circuit. Another name for a microchip.
in: Inch. The Imperial unit of measure for length used in the United States and equaling 25.4 mm.
I/O: Input/Output.
IP: (1) Intellectual Property. Commonly refers to technology that is the creation of the mind and
does not necessarily have a physical representation, such as source code. (2) Internet Protocol. See
TCP/IP.
ISA: (1) Instruction Set Architecture. The ISA of a computer defines the hardware/software interface. In other words, it precisely defines the format and behavior of the supported computer
instructions. (2) Industry Standard Architecture. An 8-bit system bus, later extended to 16 bits,
originally used in PCs. It has been replaced by the 32-bit PCI bus, and its derivatives, but is still
sometimes used in embedded devices, such as those based on PC/104.
JEDEC: Joint Electron Device Engineering Council. Now commonly known as the JEDEC Solid
State Technology Association, JEDEC defines many standards related to semiconductor technology, including packages, logic signaling, device interfaces, and so forth.

345

JTAG: Joint Test Action Group (JTAG). Commonly refers to a serial communication bus used for
debugging and testing electronic components.
kb: Kilobit. Can refer to 210 or 103 bits, depending on the context. Also commonly written as Kb,
Kbit, or kbit.
kB: Kilobyte. 210 or 1,024 bytes. Also commonly written as KB, Kbyte, or kbyte.
LCD: Liquid Crystal Display. The most common display type in embedded devices.
LED: Light-Emitting Diode. A small light source made from a semiconductor diode.
LIDAR: Light Detection and Ranging. Similar to radar and sonar, a LIDAR device uses a laser to
obtain information regarding the physical surroundings.
LRCC: Left-Right Consistency Check. A robust consistency check for eliminating erroneous
correlation matches in a stereo vision system.
LUT: Lookup Table. Commonly refers to the small memories used to implement logic on an
FPGA.
m: Meter. The SI unit of measure for length or distance.
MAE: Mean Absolute Error.
MAV: Micro Arerial Vehicle. A very small UAV.
Mb: Megabit. Can refer to 220 or 106 bits, depending on the context. Also commonly written as
Mbit.
MB: Megabyte. 220 or 1,048,576 bytes. Also commonly written as Mbyte.
Mbps: Megabits per second. Commonly refers to 106 bits per second.
MHz: Megahertz. 106 hertz (Hz).
mil: Milli-inch. 10−3 inches or one thousandth of an inch.
MIPS: (1) Million Instructions Per Second. A crude measure of CPU performance. One MIPS
on a specific CPU can actually represent a very different level of performance from one MIPS on

346

a different CPU architecture. (2) MIPS Technologies, Inc., is a company that manufactures RISC
processors. (3) A CPU designed by MIPS Technologies, Inc., using the MIPS architecture.
mm: Millimeter. 10−3 meters (m).
MOSFET: Metal-Oxide Semiconductor Field Effect Transistor. The type of transistor commonly
used in CMOS logic.
ms: Millisecond. 10−3 seconds (s).
MSE: Mean Squared Error.
MSW: Multiple Supporting Windows. An improved stereo correlation technique proposed by
Hirschmüller [45]. This name was coined by the author of this dissertation since Hirschmüller
failed to give the method a name.
mW: Milliwatt. 10−3 watts (W).
NAND: In this work, NAND refers to a particular class of flash memory devices that features
higher density and higher data transfer rates than NOR flash, but suffers from high access latency.
NAND also commonly refers to a logic gate that is the logical complement (NOT) of the AND
gate.
nm: Nanometer. 10−9 meters (m).
NOP: No Operation. Commonly refers to a computer instruction or clock cycle in which no work
is performed.
NOR: In this work, NOR refers to a particular class of flash memory devices which feature low
access latencies but smaller densities than NAND flash. NOR also commonly refers to a logic gate
that is the logical complement (NOT) of the OR gate.
NTSC: National Television System Committee. Usually refers to the standard analog television
signal format used in the United States and several other countries.
OCM: On-Chip Memory. Refers to a secondary memory interface on the PowerPC for direct
connection to on-chip memory, as opposed to using the PLB interfaces for memory accesses.

347

OPB: On-chip Peripheral Bus. An IBM-defined bus for on-chip communication between peripherals. One of the standard buses supported by the Xilinx EDK.
OS: Operating System. The software component of a computer that manages system resources
and the execution of programs.
OTG: On-The-Go. Refers to a USB standard in which a USB capable device can act as the host
or a peripheral using the same port.
oz: Ounce. The Imperial unit of measure for weight, equivalent to about 28.5 grams (g) on Earth.
PC: Personal Computer. The common name for a general-purpose desktop or laptop computer.
Also often refers specifically to a computer running Microsoft Windows.
PCA: Percentage Correct of Assigned pixels. The percentage of pixels correctly assigned a disparity estimate, out of the pixels that were not discarded by an error filter. This is a term coined by
the author of this dissertation.
PCB: Printed Circuit Board. The circuit board on which integrated circuits, connectors, and other
electronic devices are soldered to make a complete electronic system.
PCP: Percentage of Correct Pixels. The percentage of disparity map pixels that were assigned a
correct disparity estimate. This is a term coined by the author of this dissertation.
PCI: Peripheral Component Interconnect. A 32-bit system bus developed to replace the ISA bus
in PCs. It is currently being succeeded by PCI Express in PCs but can still be found in some
embedded devices, such as those based on PC/104-Plus and PCI-104.
PDA: Personal Digital Assistant. Refers to a class of small, hand-held, computer devices used for
scheduling, calendaring, note taking, and other functions.
PDS: Power Distribution System.
PEP: Percentage of Erroneous Pixels. The percentage of disparity map pixels that were assigned
an erroneous disparity estimate. This is a term coined by the author of this dissertation.

348

PHY: Physical Interface. Generally refers to the circuit that converts high speed signals, such
as USB or Ethernet, into digital ones and vice versa. The PHY connects directly to the physical
medium used for data transmission.
PLB: Processor Local Bus. A high-performance, IBM-defined bus for on-chip communication
between a processor and peripheral devices. One of the standard buses supported by the Xilinx
EDK.
POR: Power-On Reset. Refers to the automatic reset of a device that is performed after the device
is powered on.
PROM: Programmable Read-Only Memory. A type of read-only memory that can be initially
programmed to desired values. The term is also used by Xilinx to refer to memory devices used to
store FPGA configurations, even though many of these devices are erasable.
PRP: Percentage of Rejected Pixels. The percentage of disparity map pixels that were invalidated
by an error filter. This is a term coined by the author of this dissertation.
ps: Picosecond. 10−12 seconds (s).
PSoC: Programmable System on a Chip. A complete computing system implemented on a single
programmable chip, such as an FPGA.
QDR: Quad Data Rate. Refers to an SRAM architecture that employs DDR signaling in addition
to separate read and write ports that can be used simultaneously.
QFN: Quad Flat No leads. An integrated circuit package that has exposed contacts around the
edge of the package that are soldered to the PCB, but has no protruding leads. This makes the
package outline smaller than a package with leads.
RAM: Random Access Memory. A data storage device in which memory locations can be accessed in any order, typically in constant time.
R/C: Radio Control. Commonly refers to R/C cars, trucks, and airplanes enjoyed by hobbyists.
RGB: Red, Green, Blue. A color space in which an image pixel is represented by measures of the
three parameters.
349

RISC: Reduced Instruction Set Computer. A CPU emphasizing simple instructions that do less,
thus simplifying and potentially accelerating the rate at which instructions can be executed. Most
modern instruction set architectures used in embedded processors, such as the PowerPC, employ a
RISC strategy.
RMS: Root Mean Square. Used to refer to RMS error, one of the most common statistical error
measures.
ROM: Read-Only Memory. A data storage device that can be read but not written. Also sometimes refers to a memory that can be initially programmed (e.g., PROM) or erased using a special
process and rewritten (e.g., EEPROM).
RTOS: Real-Time Operating System. A simple operating system with small memory requirements designed to meet strict timing requirements.
s: Second. The SI unit of measure for time.
SAD: Sum of Absolute Differences. A similarity metric for two pixels in a stereo image pair.
Mathematically, the metric computes the L1 distance between two pixel vectors, one from each
image. Also called rectilinear, city-block, or Manhattan distance. See Section A.3 for a precise
definition.
SBC: Single-Board Computer. A small computer for embedded applications in which all essential
computing components, such as CPU, memory, and I/O, are installed on a single PCB. By this
definition, Helios is also an SBC.
SDK: Software Development Kit. A set of development tools and source code examples that
allow one to create applications for a certain platform.
SDR: Single Data Rate. Used to differentiate between DDR and standard signaling. In other
words, an SDR signal clocks data exclusively on either the rising or the falling edge of the clock.
SDRAM: Synchronous DRAM. A DRAM memory that uses a clock to synchronize data transactions.
SHD: Sum of Hamming Distances. A similarity metric for bit vectors analogous to SAD. The
term SHD was coined by the author of this dissertation to distinguish it from SAD.
350

SI: (1) Signal Integrity. Refers to the quality of electrical signals in a system, or the analysis and
verification of the quality. (2) Système International d’Unité, or International System of Units.
Defines the standard units of measure.
SMW: Symmetric Multiple Windowing. An improved stereo correlation technique proposed by
Fusiello, Roberto, and Trucco [44].
SoC: System on a Chip. A complete computing system implemented on a single chip.
SO-DIMM: Small Outline DIMM. A smaller version of the DIMM for use in notebooks and other
small computers.
SPI: Serial Peripheral Interface. A low-speed, serial bus invented by Motorola that is used to
communicate between integrated circuits.
SPST: Single pole, single throw. A simple switch in which the on or depressed position makes an
electrical connection between too contacts and the off or released position breaks the connection
between the two contacts. An example is the common light switch.
SRAM: Static RAM. A memory technology that uses bistable circuitry to store each data bit. It is
called static because it does not need to be periodically refreshed, unlike DRAM.
SSAD: Sum of SAD. An extension to the SAD correlation method allowing for the correlation of
more than two cameras.
SSSD: Sum of SSD. An extension to the SSD correlation method allowing for the correlation of
more than two cameras.
TCP/IP: Transmission Control Protocol/Internet Protocol. A standard combination of two network protocols (TCP and IP) that manages flow control and ensures in-order, reliable delivery of
data.
UART: Universal Asynchronous Receiver/Transmitter. The receiver/transmitter unit for a standard serial port (RS-232).
UAV: Unmanned Aerial Vehicle.

351

UDP/IP: User Datagram Protocol/Internet Protocol. A standard combination of two network protocols (UDP and IP) that allows for the transmission of short datagrams. Unlike TCP, UDP does
not ensure reliable or in-order delivery.
uF: Microfared. 10−6 farads (F).
UGV: Unmanned Ground Vehicle.
USB: Universal Serial Bus. A standard interface used for the connection of peripherals to a PC.
V: Volt. An SI derived unit of measure for potential difference or electromotive force.
VGA: Video Graphics Array. Commonly refers to the 640×480 graphics resolution. Historically,
the meaning of VGA is more complex.
VHDL: VHSIC (Very High Speed Integrated Circuit) Hardware Description Language. One of
the most common hardware description languages for describing synthesizable hardware circuits.
VLIW: Very Long Instruction Word. A computer architecture in which each instruction dispatched by the processor is actually a set of several simple instructions to be executed on distinct
functional units.
W: Watt. The SI derived unit of measure for power, equaling one joule of energy per second of
time.
YCbCr: A color space in which each image pixel is represented by a measure of luma (Y), blue
chroma (Cb), and red chroma (Cb).
ZBT: Zero Bus Turnaround. An SRAM architecture in which the SRAM can switch between
reading and writing without any delay.

352

