W&M ScholarWorks
Dissertations, Theses, and Masters Projects

Theses, Dissertations, & Master Projects

2006

Real -time Retinex image enhancement: Algorithm and
architecture optimizations
Glenn Derrick Hines
College of William & Mary - Arts & Sciences

Follow this and additional works at: https://scholarworks.wm.edu/etd
Part of the Computer Sciences Commons

Recommended Citation
Hines, Glenn Derrick, "Real -time Retinex image enhancement: Algorithm and architecture optimizations"
(2006). Dissertations, Theses, and Masters Projects. Paper 1539623490.
https://dx.doi.org/doi:10.21220/s2-zgwv-7r76

This Dissertation is brought to you for free and open access by the Theses, Dissertations, & Master Projects at W&M
ScholarWorks. It has been accepted for inclusion in Dissertations, Theses, and Masters Projects by an authorized
administrator of W&M ScholarWorks. For more information, please contact scholarworks@wm.edu.

REAL-TIME RETINEX IMAGE ENHANCEMENT: ALGORITHM
AND ARCHITECTURE OPTIMIZATIONS

A Dissertation
Presented to
The Facility of the Departm ent of Computer Science
The College of William <k M ary in Virginia

In Partial Fulfillment
Of the Requirements for the Degree of
Doctor of Philosophy

by
Glenn Derrick Hines
2006

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

APPROVAL SHEET
This dissertation is submitted in partial fulfillment of
the requirements for the degree of

Doctor of Philosophy

Glenn Hines

Approved, January 2006

J. Philip Kearns
Dissertation Advisor

Zia-ur Rahman
Dissertation Co-Advisor

j/\
Weizhen Mao

Mark Hinders
Department of Applied Science

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

To my lovely wife Sunita and our adorable children Jordan, Jad a and Jamison

iii

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Table o f C ontents

A ck n ow led gm en ts

viii

List o f T ables

x

List o f F igu res

xv

A b stra ct

xvi

1

In tro d u ctio n

2

2

R etin ex Im age E n h an cem en t

7

3

D ig ita l S ignal P ro cesso rs

12

3.1

TMS320C6711

...........................................................................................................

13

3.2

TMS320C6713

...........................................................................................................

16

3.3

TMS320DM642 ...........................................................................................................

17

3.4

TMS320C6416

19

4

...........................................................................................................

T est E n viron m en t
4.1

DSP Evaluation Modules

21
........................................................................................
iv

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

21

5

4.1.1

Video Capture and Display for C6711 and C6713 E V M s ..................

23

4.1.2

Video Capture and Display for DM642 and C6416 E V M s ...............

26

4.2

Development T o o ls ....................................................................................................

27

4.3

Test-Bed Components and O p e ra tio n ....................................................................

28

4.4

Performance Analysis

..............................................................................................

30

4.5

Real-time Param eter U p d a te s .................................................................................

31

4.6

Retinex Task W ithin D S P /B I O S ..........................................................................

32

O p tim iza tio n s and P erform an ce R esu lts

33

5.1

Single-Scale Monochrome Retinex O p tim iz a tio n s .............................................

34

5.1.1

Apply Convolution E q uivalence................................................................

34

5.1.2

Pre-Com pute the K e rn e l..............................................................................

36

5.1.3

Baseline Algorithm P e rfo rm a n c e .............................................................

37

5.1.4

Pre-Com pute the Logarithm

....................................................................

39

5.1.5

Use DMA to Transfer C o lu m n s .................................................................

43

......................................................

44

.....................................................................

45

5.1.8

Minimize D ata Transfer O v e r h e a d ..........................................................

46

5.1.9

Use Cache-optimized F F T s .......................................................................

48

5.1.6Reduce Gaussian Kernel Com putations
5.1.7 Merge Algorithm Components

5.2 Map

Optimized SSMR to C6713

........................................................................

49

5.3 Map

Optimized SSMR to DM642 ........................................................................

50

5.3.1

Apply Intrinsics

...........................................................................................

52

5.3.2

Modify the A rc h ite c tu re ..............................................................................

53

v

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

5.4

M ulti-Spectral Multi-Scale Retinex O ptim izations.............................................

54

...............................................................

54

5.4.2 Reduce C o m p u ta tio n s ...................................................................................

55

5.4.3 Buffer Across Spectral B a n d s ......................................................................

58

5.4.4 Allocate Log Values in L2 M e m o ry ............................................................

59

5.4.5 MSR Performance R e s u lts .............................................................................

59

5.4.1 Reuse Transformed Input Image

6

7

8

E n h anced V isio n S y ste m C ase S tu d y

68

6.1

B a c k g ro u n d ..................................................................................................................

68

6.2

Image Processing F u n c tio n s ....................................................................................

70

6.3

Additional R equirem ents...........................................................................................

73

6.4

R e su lts............................................................................................................................

76

Future R esearch

82

7.1

Luma-only R e tin e x .....................................................................................................

82

7.2

Improving C urrent P erform ance..............................................................................

83

7.3

Processing Larger Format Im a g e s...........................................................................

84

7.4

Migrating to a M ultiprocessor E n v iro n m e n t.......................................................

88

C onclu sion s

90

A M u lti-Im a g e R eg istra tio n

93

A .l B a c k g ro u n d ..................................................................................................................

94

A.2 Registration algorithms

...........................................................................................

98

A.2.1 SS a l g o r i th m ....................................................................................................

99

vi

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

A.2.2

MLR a lg o rith m ...............................................................................................

100

A.3 R e su lts............................................................................................................................

101

A.3.1

SS a l g o r i th m ..................................................................................................

102

A.3.2

MLR a lg o rith m ...............................................................................................

104

A.3.3 D iscussion.........................................................................................................

106

A.4 S u m m a r y ......................................................................................................................

108

B

F ield P rogram m ab le G a te A rrays

110

C

D M 642 E V M F lash P rogram m in g G u id elin es

114

B ib liograp h y

119

V ita

127

vii

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

ACKNOWLEDGMENTS
The list of people I wish to acknowlege would require another dissertation but, I will
point out a few. First, and foremost I thank my research advisor, Dr. Zia-ur Rahm an
for his guidance, patience, and great ideas. One day the world will know and value your
genius. I also thank Daniel Jobson for his wonderful musings on image processing research,
and Glenn Woodell for always lending a helping hand. The seeds of the Retinex th a t the
three of you planted have grown into quite a tree. Thank you for allowing me to contribute
to a branch.
I also thank my professors at William and Mary, Doctors Torczon, Bynum, Stockmeyer,
Stathopoulos, Zhang, Noonan, Prosl, Rahman, and Kearns, for all of your enlightening
classroom sessions, and patiently satisfying all of my “W hat if you tried this?” questions. In
particular, special consideration is given to Professor Kearns for giving me the opportunity
to continue my research in image processing and serving as my advisor.
Grateful acknowlegement, is given to Vanessa Godwin. Your advice throughout my
tenure as a student was invaluable. Thanks for “penetrating the bureaucracy” for me.
I also sincerely thank my committee members Weizhen Mao, Andreas Stathopoulos, and
Mark Hinders for all of your comments and suggestions, and for your willingness to take
time out of your schedule for me.
The work contained in this dissertation was supported by funding from NASA Langley
Research Center. Sincere appreciation is extended to my managers at NASA LaRC for
giving me the opportunity to pursue my research dreams. This list spans many years now
but includes Dr. Thomas Shull, Pam Rinsland, Steve Jurczvk, Steve Sandford, Randy Rea
gan, and K athryn Stacy. Thanks also goes to Steven Harrah for project support, C athryn
Murray-Wooddell for working your magic with resources, and George Allison for keeping
the Ph.D. pipeline going at NASA LaRC. Many thanks also go out to my colleagues at
NASA but especially to my old lunch bunch including Duane, Cy, Danette, Michael, Marilee, Shelley, Felicia, and Lloyd who is no longer with us. We’ve solved many of the world’s
problems on napkins, now if only we can convince everyone else to listen.
I also wish to thank a few special people — My parents David and Helen Hines for
raising me and my brothers, Ronnie and Brian, in a home th a t always valued knowledge
and education, and for your good genes! Neville and Dorothy Etwaroo for doing the same
for Sunita! My many extended family members for all of your words of encouragement and
prayers, and my many friends including Roger Bailey, Charles Stump, Shawn Williams,
Steve Green, Kenneth Arrington, Andres Alvarez, Levi Little, and Bruce Hornsby for keep
ing me laughing throughout the years.
And although I dedicated this document to my immediate family, I again give my
utm ost gratitude to my wife Sunita Etwaroo and our children Jordan Milan, Jada Nalini,
and Jamison Glenn for all of your love and support. You are my air.

viii

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

List o f Tables
3.1

DSP S p ecificatio n s..............................................................................................

5.1

Initial performance results from the first implementation of the SSMR.......

20

37

5.2 Performance measurements after using logarithm tables and combining a and
[3................................................................................................................................

41

5.3 Performance results after using 2D DMA d ata transfers.............................

43

5.4 Performance results after using 2D DMA d ata transfers.............................

45

5.5 Performance results after merging algorithm stages. Since the forward and
inverse column execution times are effectively merged together, the time to
process columns is now in item “processcols”

.....................................................

5.6 Final SSMR performance results using the C6711 DSP...............................

47

49

5.7 Measured Retinex performance on DM642 and C6416 processors. The 133
and 200 refer to the clock speed of the; EMIF bus. M easurement units are in
both milliseconds, and frames per second in parentheses...........................

64

5.8 Comparison of final SSMR performance using the C6711 and the C6416 DSPs.
5.9 C6416 CPU Loading for different Retinex configurations...........................

ix

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

65

64

6.1

Sensor S pecificatio n s...........................................................................................

72

7.1

F F T Benchmarks for C6711, DM642 and C6416...........................................

85

7.2

F F T Processing Time Benchmarks using C6711 and DM642for various sized
images.......................................................................................................................

7.3

86

F F T storage requirements and transfer times (based on row oriented data)
for various sized images. Storage is based on complex image d ata stored

as

integers. Transfer times are based on a 64-bit EMIF bus clocked at 133 MHz. 86

Sensor S pecificatio n s...........................................................................................

97

A.2 U pdated Sensor Specifications..........................................................................

98

A.3 Visible to SWIR MLR C o effic ien ts................................................................

104

A.4 LWIR to visible SWIR MLR Coefficients.......................................................

105

A .l

x

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

List o f Figures
2.1 The top row of images from left to right have simulated tungsten, fluorescent,
and sunlight illumination sources. The bottom row has the same images after
Retinex processing. The effects of the different illumination sources is nearly
completely removed......................................................................................................

8

2.2 Many image processing algorithms would either saturate the bright regions
or clip the dark regions of the image on the left. The Retinex processed
image on the right appears almost uniformly illuminated without exhibiting
these effects.....................................................................................................................

9

2.3 On the left is a low contrast, dimly lit grayscale digital image; on the right is
the single-scale Retinex processed image — single-scale processing increases
the contrast and sharpness..........................................................................................

10

3.1 Prim ary DSP components include the CPU, LI D ata Cache, LI Program
Cache, L2 memory (SRAM /Cache) and EDMA Controller...............................

13

3.2 General outline of 2-level internal memory architecture of C67x processors.
The dashed boxes are user addressable m em o ry ..................................................

15

3.3 Configuration modes for the C6711 L2 memory.....................................................

16

xi

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

3.4 Block diagram of prim ary DM642 components. The DM642 has special in
struction extensions to accelerate video applications...........................................

17

3.5 Block diagram of prim ary C6416 Components. Note the larger L2 memory
and 64-bit EMIF bus....................................................................................................

4.1

Picture of DM642 EVM board.

20

Numerous components are on the EVM

circuit board to support testing the DSP for a wide variety of applications.
We primarily use the peripherals associated with video capture and display.

22

4.2 IDC video capture s u b s y s te m ..................................................................................

24

4.3 IDC video display s u b s y s t e m ..................................................................................

25

4.4 DM642 EVM block d ia g ra m ......................................................................................

27

4.5 C6416 EVM block d i a g r a m ......................................................................................

28

4.6 Block diagram of the test-bed — the Host PC only provides setup information
to the EVM; after initiation, the DSP executes independently..........................

29

5.1 C apture Video Frame with input from camera on the left, and Retinex output
on the right. Retinex param eters are a — 175, (3 = 135, and cr = 80 — note
th a t we are nearly reaching the noise limit of the cam era..................................

50

5.2 Retinex performance in time (bottom axis) and frames per second (top axis)
to process 1 spectral band of image d ata on DM642 with 133 MHz EMIF
(dotted line), DM642 with 200 MHz EMIF (dashed line), and C6416 (full line). 61
5.3 Retinex performance in time (bottom axis) and frames per second (top axis)
to process 2 spectral bands of image d ata on DM642 with 133 MHz EMIF
(dotted line), DM642 with 200 MHz EMIF (dashed line), and C6416 (full line). 62
xii

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

5.4 Retinex performance in time (bottom axis) and frames per second (top axis)
to process 3 spectral bands of image data on DM642 with 133 MHz EMIF
(dotted line), DM642 with 200 MHz EMIF (dashed line), and C6416 (full line). 63
5.5 First snapshot taken 40 seconds into the video recorded at NASA LaRC.
The frame as captured by the camera is on the left and the real-time Retinex
processed frame ison the right....................................................................................

66

5.6 Second snapshot taken 6 minutes and 28 seconds into the video. Colors are
nearly completely indeterminable and objects are difficult to distinguish in
the unprocessed image. Colors and objects are still clear in the processed
fram e................................................................................................................................

66

5.7 Third snapshot taken 14 minutes 28 seconds into the video. The only dis
tinguishable object in the unprocessed frame is the tail-lights on the vehicle.
Although noisy, the real-time Retinex processed image still clearly shows
most of the m ajor objects in the first snapshot including spheres, tree lines,
and parked vehicles...........................

67

6.1 The EVS LWIR, SWIR, and visible-band cameras mounted to a baseplate,
and the enclosure shell. Inaccurate bore-sighting can cause image registration
problem s..........................................................................................................................

69

6.2 EVS camera enclosure mounted forward-looking underneath the NASA 757.

70

6.3 The EVS acquires d ata during the entire flight but take-off and landing phases
are critical. The simulated shaded area depicts the field of view (FOV) of
the cam eras.....................................................................................................................

xiii

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

71

6.4

Examples of the imagery generated by each camera in good weather condi
tions. The images from cameras must be registered, enhanced, fused and
displayed to the pilot in real-tim e.............................................................................

6.5

72

Image processing architecture and functions of the EVS. Analog NTSC cam
era outputs are currently processed. The SWIR d ata is used as the baseline
for registration since it has the smallest field of view..........................................

73

6.6

DM642 EVM, signal splitter boards, and power supply in flight box..............

75

6.7

Flight box in flight pallet on NASA 757.................................................................

75

6.8

A frame from the EVS SWIR camera before processing. The faint vertical
lines were part of the input image and probably caused by subsampling in
the video distribution system .....................................................................................

6.9

78

A frame from the EVS LWIR camera before processing. The LWIR camera
output is actually rotated 180° from w hat is shown............................................

79

6.10 SWIR frame after enhancem ent................................................................................

80

6.11 LWIR frame after enhancement and registration to the SWIR image.............

80

6.12 Enhanced, registered and fused output image.......................................................

81

7.1

D ata flow diagram of MSR t a s k s ...........................................................................

89

A .l Original SWIR................................................................................................................

102

A.2 Original LWIR................................................................................................................

102

A.3 Original V is ib le ............................................................................................................

102

A.4 Cropped S W I R ............................................................................................................

103

A.5 SS Reg. L W I R ............................................................................................................

103

xiv

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

A.6 SS Reg. v i s i b l e ............................................................................................................

103

A.7 SWIR and SS Registered LWIR.................................................................................

104

A.8 SS Registered LWIR and v isib le..............................................................................

104

A.9 R epeated SWIR

.........................................................................................................

105

A .10 MLR Reg. v isib le .........................................................................................................

105

A.11 SWIR and MLR Reg. v i s i b l e ..................................................................................

105

A. 12 R epeated MLR Registered v is ib le ...........................................................................

106

A .13 MLR Reg. LWIR...........................................................................................................

106

A. 14 MLR Reg. visible and LWIR

..................................................................................

106

A .15 Orig. SWIR at Time 26:14:28 ..................................................................................

107

A. 16 Orig. LWIR at Time 26:14:28 ..................................................................................

107

A .17 Orig. visible at Time 26:14:28 ..................................................................................

107

A .18 MLR Registered visible at Time 26:14:18

107

A. 19 SWIR and MLR Registered v is ib l e ........................................................................

107

A.20 MLR Registered LWIR at Time 26:14:18..............................................................

108

A.21 MLR Registered visible and L W I R ........................................................................

108

B .l

Ill

High-level block diagram of a typical FPGA A r c h ite c tu r e .............................

xv

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

ABSTRACT

The field of digital image processing encompasses the study of algorithms applied to
two-dimensional digital images, such as photographs, or three-dimensional signals, such as
digital video. Digital image processing algorithms are generally divided into several distinct
branches including image analysis, synthesis, segmentation, compression, restoration, and
enhancement. One particular image enhancement algorithm th a t is rapidly gaining wide
spread acceptance as a near optimal solution for providing good visual representations of
scenes is the Retinex.
The Retinex algorithm performs a non-linear transform th a t improves the brightness,
contrast and sharpness of an image. It simultaneously provides dynamic range compression,
color constancy, and color rendition. It has been successfully applied to still imagery cap
tured from a wide variety of sources including medical radiometry, forensic investigations,
and consumer photography. Many potential users require a real-time im plem entation of the
algorithm. However, prior to this research effort, no real-time version of the algorithm had
ever been achieved.
In this dissertation, we research and provide solutions to the issues associated with per
forming real-time Retinex image enhancement. We design, develop, test, and evaluate the
algorithm and architecture optimizations th a t we developed to enable the implementation
of the real-time Retinex specifically targeting specialized, embedded digital signal proces
sors (DSPs). This includes optimization and mapping of the algorithm to different DSPs,
and configuration of these architectures to support real-time processing.
First, we developed and implemented the single-scale monochrome Retinex on a Texas
Instrum ents TMS320C6711 floating-point DSP and attained 21 frames per second (fps)
performance. This design was then transferred to the faster TMS320C6713 floating-point,
DSP and ran at 28 fps. Then we modified our design for the fixed-point TMS320DM642
DSP and achieved an execution rate of 70 fps. Finally, we migrated this design to the fixedpoint TMS320C6416 DSP. After making several additional optimizations and exploiting the
enhanced architecture of the TMS320C6416, we achieved 108 fps and 20 fps performance for
the single-scale, monochrome Retinex and three-scale, color Retinex, respectively. We also
applied a version of our real-time Retinex in an Enhanced Vision System. This provides a
general basis for using the algorithm in other applications.

xvi

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

REAL-TIME RETINEX IMAGE ENHANCEMENT: ALGORITHM
AND ARCHITECTURE OPTIMIZATIONS

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Chapter 1

Introduction
Digital image processing encompasses the research and application of signal processing
techniques applied to two-dimensional digital images, or three-dimensional signals such as
digital video. The field originates from the confluence of large-scale digital com putation and
the requirement to improve the imagery generated by the U.S. space program in the midf960’s [20]. Over the last 40 years com putation technologies have experienced phenomenal
growth and digital image processing has benefited from this progress to become a tool th a t
is used in a wide variety of applications. There are now several branches of digital image
processing, each representing different aspects of the field. These branches include image
analysis, segmentation, compression, synthesis, restoration, and enhancement [20, 30]. One
particular image enhancement algorithm th a t is rapidly gaining wide-spread acceptance as
a near optimal solution for providing good visual representations of scenes is the Retinex.
The Retinex performs a com putationally intensive, non-linear spatial/spectral transform
th a t synthesizes strong local contrast enhancement and color constancy [33]. It is used
to improve the brightness, contrast and sharpness of an image. It has been successfully

2

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

3

applied to still imagery captured from a broad range of sources including aviation safety,
medical radiometry, forensic: investigations, m ilitary operations, homeland security, and
consumer photography [103, 55]. It is offered in the commercially available software package
PhotoFlair by TruView [99]. Several users require a real-time, embedded implementation
of the Retinex, but prior to this research effort, no real-time version of the algorithm
had ever been achieved. Real-tim e1 is defined here as continuously capturing, processing
and displaying 15-30, 256 x 256 sized images2 (frames) per second. Embedded implies a
system or component th a t is, in general, relatively small, inexpensive, and consumes very
little power [19].
One reason th a t a real-time version of the Retinex had not been achieved is because
the Retinex is inherently com putationally intensive due to the large volume of d ata th a t
must be stored, processed, and transferred between processor and memory. The algorithm
also entails performing multiple, large convolutions and requires orthogonal d ata accesses
th a t exacerbate the problem. Another reason is the inefficiency of most general-purpose
computing platforms for real-time Retinex processing — as well as for many other digital
image processing algorithms. Today’s general-purpose processors, such as 2.5 GHz Pentium
4s, possess sufficient com putation power to provide reasonable processing rates for Retinex
processing of small, still images. However, in general, they do not have the proper archi
tecture, operating system, or development tools to effectively meet the time constraints
required for real-time Retinex processing. In addition, many applications limit the proces
sor selection to components th at can be embedded into a system. Many general-purpose
: A re a l-tim e sy ste m is one th a t satisfies explicit b o u n d e d resp o n se -tim e c o n s tra in ts to avoid failure [89].
2All im age sizes, such as 256 x 256, in th is d is se rta tio n a re ex p ressed using 8 -b it pixels.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

4

processors consume too much power or are too expensive to be used for these types of
applications.
Several specialized, high-performance hardware architectures and technologies are suit
able for this task. Application specific integrated circuits (ASICs) [62] are one-of-a-kind
custom devices targeted towards a specific task and provide excellent performance at the
expense of long development times and high cost. Field programmable gate arrays (FPGAs) [58, 51] are an attractive alternative th a t offer relative ease of programming, high
performance and reconfigurability to support custom applications. Digital signal processors
(DSPs) [4] are inexpensive, easy to program — usually in common high level languages such
as C — and offer good performance. DSPs are optimized for processing signals in real-time
and offer some limited flexibility in architecture configuration. Several other esoteric tech
nologies, such as array processors, are also available [36, 35]. However, for quick, low cost
development, DSPs are a suitable and sufficient design choice.
In this dissertation, we examine and provide solutions for the issues associated with
performing real-time Retinex image enhancement.

We design, develop, test and evalu

ate the algorithm and architecture optimizations required to enable the implementation of
the real-time Retinex specifically targeted for specialized, embedded DSPs. This includes
optimization and mapping of the algorithm to different DSPs and configuration of these
architectures to support real-time processing. We also develop and apply a particular in
stance of our research efforts for the real-time Retinex into an Enhanced Vision System [98].
This provides a general basis for using the algorithm in other applications or missions.
First, we developed and implemented the single-scale monochrome Retinex executing on
a Texas Instrum ents TMS320C6711 floating-point DSP and attained 21 frames per second

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

(fps) performance [24], This design was later transferred to the slightly faster TMS320C6713
floating-point, DSP and ran at 28 fps [25]. We then modified our design targeting the fixedpoint TMS320DM642 DSP and initially achieved an execution rate of 34 fps [25]. Further
refinements and optimizations improved our performance to nearly 70 fps. This design was
implemented as part of an Enhanced Vision System (EVS) and dem onstrated during EVS
flight tests in August and September of 2005. Inputs from two single-band cameras were
Retinex enhanced, registered, and fused. The system operated at over 34 fps. Finally, we
migrated our design to a TMS320C6416 fixed-point DSP. After making several additional
optimizations and exploiting the enhanced architecture of the TMS320C6416 we obtained
108 fps performance for the single-scale, single-band (monochrome) Retinex and 20 fps
performance for the three-scale, three-band (color) Retinex.
Several different user communities will benefit from this enabling technology. The Avia
tion Safety Program Office at NASA LaRC will continue to support applying the real-time
Retinex in future technology demonstrations on the NASA LaRC ARIES 757 (NASA 757)
research aircraft. The Transportation Security A dm inistration is interested in using the
Retinex in applications to improve Homeland Security. The U.S. Army has provided fund
ing to study using the real-time Retinex as part of a system to find improvised explosive
devices (IEDs) from unmanned aerial vehicles (UAVs). The real-time Retinex also has been
identified for potential use in future NASA space programs including lunar and planetary
exploration missions and autonomous landing systems.
In Chapter 2 of this dissertation, we discuss the mathem atics behind the Retinex algo
rithm . In C hapter 3 we give an overview of the architectures of our chosen DSP hardware.
In C hapter 4 we describe our test environment, and the software tools used to develop,

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

implement and measure the performance of the real-time Retinex. C hapter 5 is the heart of
this dissertation. In it we discuss the optimization techniques we developed and applied to
achieve real-time Retinex performance. In C hapter C we describe the EVS, and discuss how
particular instances of the real-time Retinex were used in this context. In C hapter 7 we dis
cuss future Retinex research issues and their potential solutions. This includes discussions
of distributing the core structures developed for the DSP platforms into a multiprocessor
environment, and the algorithm and architecture modifications required to process larger
format images. Finally, in C hapter 8 we give our conclusions to this research.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

C hapter 2

R etin ex Im age E nhancem ent
The Retinex is a general-purpose image enhancement algorithm th a t is used to produce
good visual representations of scenes. The algorithm is derived from the last version of
Edward Land’s Retinex model [37] of the innate ability of human vision to perceive vivid
color and detail across widely varying lighting conditions. In addition, this perception is
relatively independent of the spectral characteristics of the illuminant. Jobson, et al. ex
tended and improved Land’s Retinex into a general-purpose enhancement algorithm th a t
simultaneously provides dynamic range compression, color constancy, and color and light
ness rendition. The first version of their work, the single-scale Retinex (SSR), provided
good performance, but traded-off dynamic range compression for color rendition [33]. They
improved their design by using multiple scales (multi-scale) within the Retinex (MSR) to
address this tradeoff, and additionally added a method of color restoration to improve color
rendition when gray-world violations occur within an image [32]. O ther methods, such as
post-processing using a white balance technique [5G] have also been added. These additions
extend the potential utility of the Retinex, but they also increase the com putational require-

7

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

8

Figure 2.1: The top row of images from left to right have simulated tungsten, fluorescent, and
sunlight illumination sources. The bottom row has the same images after Retinex processing. The
effects of the different illumination sources is nearly completely removed.

ments of the algorithm. We concentrate

011

the SSR and MSR versions of the algorithm.

Figure 2.1 is an example th a t shows the color constancy property of the Retinex. The
top row of images have simulated tungsten, fluorescent, and sunlight illumination sources
from left to right respectively, and the bottom row is the image after Retinex enhancement.
The Retinex processing has almost totally removed the effect of different illuminants on the
scene. Figure 2.2 is a good visual illustration of the dynamic range compression property.
Retinex processing of the image on the left dram atically brings out the details in the dark
regions of the image without saturating the bright regions. Both of these examples are
processed using the color version of the MSR. Figure 2.3 shows an example of monochrome
SSR processing. The contrast and sharpness of the original is improved significantly.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Figure 2.2: Many image processing algorithms would either saturate the bright regions or clip the
dark regions of the image on the left. The Retinex processed image on the right appears almost
uniformly illuminated without exhibiting these effects.
The Retinex is a member of the class of center/surround functions which are similar to
well known difference-of-Gaussian (DOG) functions [27, 54], For the Retinex, the center is
one pixel wide and its magnitude is the pixel value and the surround is a Gaussian. The
single-scale Retinex is given by

Ri(x i . x 2) = lo g (/,(x i.x 2)) -

where

log(/,;(xi,;r2) * F( x i,.r 2)),

i = 1,....5

(2.1)

I, and 11, are the 7th spectral band of the input and output image, respectively. For

a grayscale image 5 = 1 and for a standard color image 5 = 3. The log is the natural
logarithm function and

represents convolution. F is a Gaussian surround (or kernel)

function defined by

F{ x l . x 2) = Kex p [-(.rf + :r2)/<x2]

(2.2)

where a controls the spatial extent of the surround, and k = 1 / ( 23Xl 23X2 F ( x i,:r2)) is a

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

10

Figure 2.3: On the left is a low contrast, dimly lit grayscale digital image; 011 the right is the
single-sc.ale Retinex processed image — single-scale processing increases the contrast and sharpness.

normalization factor. Canonical gain, a, and offset, f3, values are applied to convert the
Retinex output into the user display domain, so the final form of the single-scale Retinex is

Rd{x\,x2) = a (lo g (7 j(x i,x 2)) - lo g (/i(z i,x 2) * F ( x i , x 2))) - ft,

i = l,...,S

(2.3)

Values for a , ft, and a are application dependent and determined empirically. For example,
in normal room light conditions values of 200, -120, 80 respectively produce good results.
The multi-scale Retinex is defined as the weighted sum of K SSR outputs, where K is
the number of scales. Thus the MSR is given by
K

R i(* 1 , 2 :2 ) = ^ I T \ ( l o g (Ii(xl, x 2)) - \og(Ii( x l, x 2) * Fk( x i , x 2)))

(2.4)

1

k=

where the Fk are now defined as

Fk(x 1 , x 2) = Kk exp[—(x'f + x

l

(2.5)

The Wk are the weighting factors and the Kk are the normalization factors associated with
each scale. Jobson et al. [32] have shown, empirically, th a t three scales with reasonable local

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

11

to global coverage, and equal weights provide good performance for most images. Again a
canonical gain a , and offset, j3, are applied thus the final form of the MSR is

K

Ri ( xI , x 2) =

w k{\og(Ii{xi,x2)) - \og(Ii(xi,x2) * Fk( x i , x 2))) - f3.
k=

(2.6)

i

The derivation of the com putational complexity of the Retinex is straightforward. As
sume th a t the input image dimension size is N x N, the extent of the surround, F . is
M x M, circular convolution is performed in the spatial domain, and ignore the operations
involving o, fj, W k and the com putations required to generate Fk . We show in Section 5.1.2,
th a t these are all valid assumptions. Then for the single-scale monochrome Retinex, there
are M 2 multiplies and M 2 — 1 additions for every pixel. There are also 2 N 2 logarithm
operations — two logarithms for each pixel, and N 2 subtractions. Thus, the running time
of the algorithm is driven by the convolution operation and the complexity is 0 ( N 2M 2).
As the extent of F approaches the size of the image, i.e. M —>N, the complexity becomes
0 ( N 4). For the one scale, multi-spectral case, the monochrome algorithm is performed S
times, once for each spectral band. The complexity remains the same, 0 ( N 2M 2). For the
multi-scale, multi-spectral case, the convolution and the other arithm etic operations are
performed K times, once for each scale. This is subsequently repeated S times, once for
each spectral band. Additionally, as discussed in Section 5.4, for any m ulti-spectral case,
functions may be required to divide the spectrum into its individual component parts for
processing, and to combine the processed components back together again. However, the
complexity still remains the same -

( ) ( N2M 2). M ethods to reduce the running time of the

algorithm are discussed in Chapter 5.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

C hapter 3

D igital Signal Processors
For our research we have selected four state-of-the-art Texas Instrum ents (TI) DSPs for
implementation and performance evaluation of the real-time Retinex (RTR). TI processors
were chosen because of their flexible and powerful architecture, good support tools, avail
ability of the DSPs to the researchers, low cost of evaluation boards, and our past familiarity
with using TI processors. M any other DSPs, such as Analog Devices SHARC processors,
would also provide' reasonable hardware platforms for implementation. All of the TI DSPs
th a t were chosen are based on an advanced very-long-instruction-word (VLIW) [71] archi
tecture. This type of architecture achieves high performance by exploiting instruction-level
parallelism. Multiple execution units operate in parallel to execute multiple instructions
during a single clock cycle. Our four target DSPs are the TMS320CG711, TMS320C6713,
TMS320DM642, and TMS320C6416. In this chapter we discuss the relevant details of each
of these processors.

12

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

13

I SRAM I—
ISBSRAMl—
I ROM

f-

L IP Cache 4-KB
Direct Mapped

EMIF
32-Bit
Timers

McBSPs

EDMA
Controller

HPI

L2
Memory
64-KB

C 671X D S P C ore
Instruction Fetch
Instruction Dispatch
Instruction Decode
Register Files
Functional Units

INT
SEL

LID Cache 4-KB
2-Way Set Assoc

F ig u r e 3.1: P rim a ry D SP com ponents include th e C PU , L I D a ta Cache, L I P rogram Cache, L2
m em ory (S R A M /C ache) and E D M A C ontroller.

3.1

TM S320C 6711

Our first target, the TMS320C6711B (C6711) DSP, is a 32-bit floating point processor
that offers up to 1200 millions instructions per second (M IPS)/900 million floating point
operations per second (MFLOPs) performance at a clock rate of 150 MHz (6.67 ns cycle
time) [73], As shown in Figure 3.1 the processor is divided into three main components:
the CPU (or core), memory, and peripherals.
The CPU has eight independent functional units and a 256-bit, wide instruction word
th a t allows up to eight 32-bit instructions to be supplied to the units during every clock
cycle. The functional units are m apped into two sets where each set contains four units and
a register file. In total the eight functional units provide four fixed/floating point arithm etic
logical units (ALUs), two fixed-point ALUs, and two fixed/floating-point, multipliers. Two
multiply-and-accumulate (MACs) per cycle can be performed for a total of up to 300 Million
MACs (MMACs) per second. Each of the two register files contains sixteen 32-bit registers

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

14

for a total of 32 general-purpose registers. Six of the functional units have access to the
register file on the opposite side via a cross path. Like a MIPS processor, the CPU uses a
load/store architecture, where all instructions operate on registers. There are dual 64-bit
load d ata paths and dual 32-bit store d ata paths.
The DSP has a two-level memory architecture for both program and d ata [88]. Figure 3.2
is a general outline of the architecture. This hierarchical architecture is used to reduce the
average memory access tim e by exploiting the tem poral or spatial locality of data [87]. The
Level 1 d ata cache (LID) is a 32-Kbit 2-way set associative cache th a t services d ata accesses
from the CPU. It has a 32-Byte line size and 64 sets. The LID is implemented with a single
bank of dual-ported 64-bit memory and can service up to two d ata accesses from the CPU
on every cycle. The LID is a read-allocate cache, but does not write-allocate1. A 32-bit
by 4-entry write buffer between the LID and the L2 memories is used to capture write
misses. The Level 1 program cache (LIP) is a 32-Kbit, direct-mapped, read-allocate cache
th a t services program fetches from the CPU. It has a 64-Byte line size and 64 sets.
The Level 2 (L2) memory space is 64-KByt,es th a t can be configured as all SRAM, all
cache, or combinations of the two in 16-KByte increments. This memory services requests
from the LIP, LID, enhanced direct memory access (EDMA), or internal cache operations,
with request priority from highest to lowest as listed. It is divided into four 64-bit, banks
th a t operate at the C PU ’s clock rate, 150 MHz, but pipelines accesses over two cycles. Any
portion of L2 configured as cache (L2 Cache) is organized as 128 sets with 128-Byte line
size. The associativity varies from 1-way for when the cache capacity is 16-KBytcs, up
: A re a d /w rite -a llo c a te cache a llo c a te s sp ace (i.e. selects a lo catio n in th e cache) on a re a d /w r ite miss
accord in g to th e cache a llo catio n policy.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

15

C67x CPU
User
Addressable

Program
D ata
LI Cache Cache 4-KB
Cache 4-KB

SRAM

Write
Buffer

Cache

Configurable On-Chip L2 Memory
External Memory

F ig u r e 3.2: G eneral outline of 2-level in ternal m em ory architecture of C67x processors. T he dashed
boxes are user ad dressable memory.

to 4-way at 64-KBytes. The different configuration modes are shown in Figure 3.3. The
operation of L2 Cache is similar to th a t of both the L IP and LID caches. On a cache hit
the L2 cache services the request directly. The L2 Cache is a writeback 2 cache so external
memory is not updated until the line is either evicted or w ritten back using cache control
registers. Unlike the LID, the L2 Cache is read-allocate and write-allocate. A least-recently
used policy (LRU) is used for line selection.
Several peripherals are located within the processor. There is a multichannel EDMA
controller th a t supports up to 16 channels of d ata transfers There is a host port interface
(HPI) th a t allows a host processor to directly address the CPU ’s memory space. There is
also a 32-bit external memory interface (EMIF) th a t provides an interface to external devices
such as synchronous dynamic random access memory (SDRAM) and read-only memories
(ROMs) [78].
2W riteb ack is th e p ro cess o f w ritin g d a ta th a t h a s been m odified from a v alid, b u t now d irty cache line to
low er-level m em ory. W rite h its to a w rite b a c k cache a re no t im m ediately forw ard ed to low er-level m em ory.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

1G

64-KB
Mapped
Memory

48-KB
Mapped
Memory

16-KB
1-Way
Cache

32-KB
M apped
Memory

32-KB
2-Way
Cache

16-KB
Mapped
Memory

48-KB
3-Way
Cache

64-KB
4-Way
Cache

F ig u r e 3 .3 : C onfiguration m odes for th e C6711 L2 memory.

3 .2

TM S320C 6713

Our second target, the TMS320C6713 (C6713), is a 32-bit floating point processor th a t
performs up to 1800 M IPS/1350 M FLOPS at a clock rate of 225 MHz (4.4 ns instruction
cycle time) [84]. The architecture of the C6713 is very similar to the C6711, and code
operating on one device directly ports over to the other [92]. The most relevant differences
in the two devices are listed below.
• The C6713 operates at 225 MHz while the C6711 only operates at 150 MHz.
• The C6713 has a larger internal memory. The LI caches are the same, but the C6713
has an additional 192-KBytes of SRAM in L2 th a t only functions as mapped memory.
• The C6713 has a software-configurable Phase-Loek Loop (PLL) controller th a t can
be used to select different clock frequencies for the DSP core, peripherals and the
EMIF [94], Speeding up EMIF transfers can enable faster throughput.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

17

SRAM
ISBSRAMI—
I ROM

L IP Cache 16-KB
Direct M apped

EMIF
64-Bit
Timers

McBSPs

HPI

EDMA
Controller

L2
Memory
256-KB

D M 642 D S P C ore
Instruction Fetch
Instruction Dispatch
Instruction Decode
Register Files
Functional Units

Video
Ports (3)

LID Cache 16-KB
2-Way Set Assoc

F ig u r e 3.4: B lock diagram of p rim ary DM642 com ponents. T he DM642 has special in struction
extensions to accelerate video applications.

3 .3

T M S320D M 642

Our third target is the TMS320DM642 (DM642). The DM642 is a 32-bit fixed-point pro
cessor th at performs up to 4800 MIPS at a clock rate of 600 MHz (1.67 ns instruction cycle
time) [86]. A block diagram of the processor is shown in Figure 3.4. The DM642 also has
eight independent functional units consisting of six ALUs and two enhanced multipliers.
In addition to standard multiplies, the multiply units include hardware th a t can perform
bit-count, rotates, and bidirectional variable shifts. Four 32-bit, MACs per cycle can be
performed for a total of 2400 MMACs per second, or eight 8-bit, MACs per cycle for a total
of 4800 MMACS. There are new instruction extensions to accelerate video and imaging
applications, and to improve the parallelism of the architecture [79]. This includes support
for packed 8-bit, and 64-bit d ata types, and instructions th at perform non-aligned loads and
stores of words or double words.
The DM642 also has a two-level cache [95]. The L IP is a 16-KByte direct-mapped

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

18

cache with 32-Byte line size and 512 sets. Multiple cache misses are pipelined. The LID
is lG-KBytes deep and is 2-way set associative with a 64-Byte line size and 128 sets. It is
implemented as eight 32-bit wide banks of single-ported memory, as opposed to the single
bank of dual-ported memory of the C671X devices. Each single-ported bank allows only
one access per cycle. The LID is a read-allocate only cache where new lines are allocated
for LID read misses but not write misses. The LID implements a LRU line allocation policy
for read misses and pipelines multiple misses. A 64-bit by 4-entry write buffer between LID
and L2 memory captures d ata from write misses. This buffer is an enhanced version of the
one in the C671X in th a t the L2 can process a new request from the write buffer every
cycle, as opposed to every 2 cycles on the C671X, provided th a t the L2 bank is not busy.
Additionally, the DM642 write buffer allows merging of write requests, thus effectively
increasing the write buffer capacity, reducing the stall penalty, and reducing the overall
number of write operations the L2 must process.
The L2 memory is 256-KBytes th a t can be configured as local SRAM, cache or combi
nations of the two. This memory services cache misses from the LIP, the LID, the EDMA
controller and internal cache operations with request priority from highest to lowest as
listed. It is divided into eight 64-bit, banks th a t operate at the C PU ’s clock rate, 600 MHz,
but pipelines accesses over two cycles. Four L2 Cache configuration modes are supported:
32-KByte capacity organized as 64 sets, 64-KByte capacity as 128 sets, 128-KByte capacity
as 256 sets, and 256-KByte capacity as 512 sets. L2 Cache is always 4-way set associative
with 128-Byte line sizes and operates as a write-back cache. A cache line is allocated for
both read and write misses, and a LRU policy is used for line selection.
The DM642 also has many of the same peripherals as the C671X devices with several

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

19

extensions and additions including a 64-bit, EMIF and three configurable video port pe
ripherals [89]. The video ports provide a glue-less interface to common video decoder and
encoder devices. Each video port can be configured for either video capture or display, and
each port supports up to two channels with a 5120-Byte buffer th a t is shared between the
two channels.

3 .4

T M S320C 6416

Our fourth target is the TMS320C6416 (C6416). The C6416 is a 32-bit fixed-point processor
th a t performs up to 8000 MIPS at a clock rate of 1000 MHz (1 ns instruction cycle time) [97].
A block diagram of the processor is shown in Figure 3.5. The C6416 has eight independent
functional units consisting of six ALUs and two enhanced multipliers capable of performing
four 16-bit, x 16-bit multiplies every clock cycle with ad d /su b tract operations. Four 32-bit
MACs per cycle can be performed for a total of 4000 MMACs per second, or eight 8-bit
MACs per cycle for a to tal of 8000 MMACS. . The C6416 also includes support for packed
8-bit and 64-bit d ata types, and allows for non-aligned loads and stores of w ords/double
words [79]. There are two register files, each containing 32, 32-bit registers for a total of
64 general-purpose registers. All eight of the functional units have access to the opposite
register file and the dual load and store d ata paths are 64-bit,s wide.
The C6416 also has a two-level cache [97]. The L IP and LID are the same size and
operate the same as the respective memories on the DM642. The L2 memory has been
increased to 1024-KBytes and can be configured as all mapped memory or combinations
of cache (up to 256-KBytes) and mapped memory. Any portion of L2 memory partitioned

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

20

I SRAM I—
1SBSRAMI—
I ROM

L IP Cache 16-KB
Direct Mapped

EMIF
64-Bit

|-

EMIF
16-Bit

McBSPs

EDMA
Controller

L2
Memory
1024-KB

HPI

C 6416 D S P C ore
Instruction Fetch
Instruction Dispatch
Instruction Decode
Register Files
Functional Units
LID Cache 16-KB
2-Way Set Assoc

PCI

F ig u r e 3.5: Block diagram of p rim ary C6416 C om ponents. N ote th e larger L2 m em ory and 64-bit
E M IF bus.

as cache has the same modes as on the DM642. The C6416 has two EMIFs: one 64-Bits
wide and one 16-Bits wide. The total external addressable memory space of 1280-MBytes.
Table 3.1 summarizes the pertinent param eters of the DSPs.
DSP

Type

C6711
C6713
DM642
C6416

Floating-pt
Floating-pt
Fixed-pt
Fixed-pt,

Frequency
(MHz)
150
225
720
1000

LI
(K-Bytes)
8
8
32
32

L2
(K-Bytes)
64
256
256
1024

EMIF
(W idth)
1 32-bit
1 32-bit
1 64-bit
1 32-bit
1 64-bit

T a b le 3.1: D SP Specifications

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

EMIF Clk
(MHz)
100
90
133
100
100

Chapter 4

Test Environm ent
We now describe the platforms th a t support each DSP and the general hardware and soft
ware test environment. This environment will be used to test, analyze and evaluate our
optimization techniques discussed in C hapter 5.

4.1

D S P E v a lu a tio n M o d u le s

Each DSP is embedded on a different printed circuit board for test and evaluation. The
circuit boards are called EVMs (evaluation modules). Figure 4.1 shows the EVM for the
DM642. The other EVMs look similar to this. As can be seen in the figure, each EVM has
several components and interfaces to support the associated DSP. We will briefly describe
the EVMs for each of our selected DSPs only defining the parts relevant to our discus
sion. We will then describe the tools used for software development, optimization, and
performance analysis.
The C6711 EVM has 16-MByt,es of SDRAM clocked at 100 MHz th at is used as exter
nal memory for the chip. There are 128-KBytes of flash memory which is usually used to
21

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

22

F ig u r e 4.1: P ic tu re of DM 642 EV M board. N um erous com ponents are on th e EVM circuit board
to su p p o rt te stin g th e D SP for a wide variety of applications. We prim arily use th e peripherals
associated w ith video c a p tu re and display.

hold application code and param eters when power is disconnected from the board. Com
munication to a host PC — primarily for downloading code and gathering statistics — is
through a parallel port. An embedded Joint Test Action Group (JTAG) controller is used
for emulation and debugging [28]. The board also has an expansion connector to support
adding additional memory, peripherals, or daughter-cards [77].
The C6713 EVM has 8-MBytes of SDRAM clocked at a default rate of 90 MHz and 512KBytes of flash memory. Communication to a host PC is performed through a Universal
Serial Bus (USB) port. An embedded USB JTAG controller is provided for debugging [66].
The EVM also has an Intel LXT971 Ethernet port for d ata transfers to an external device.
The DM642 EVM lias 32-MBytes of SDRAM clocked at 133 MHz, 4-MBytes of flash
memory, an Intel LXT971 Ethernet interface, and a standard JTAG connector for external
emulation [67]. The C6416 EVM has 256-MBytes of SDRAM on the 64-bit EMIF bus
and 8-MBytes on the 32-bit wide EMIF bus. Both busses are clocked at 100 MHz. The

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

23

board also has 4-MByt,es of flash memory, and a dedicated JTAG connector for external
emulation [3].

4 .1 .1

V id e o C a p tu r e a n d D is p la y for C 6 7 1 1 a n d C 6 7 1 3 E V M s

For the C6711 and CG713 EVMs, video capture, display, and d ata form atting are performed
by an imaging daughter-card (IDC) [76] th a t connects to each board’s expansion connectors.
The main components of the IDC are a TI TVP5022 digital video decoder chip [74], a TI
TVP3026 RAMDAC digital video encoder chip [70], a Xilinx FPG A for control, buffer
management and interface logic, and 2-MBytes of SDRAM for capture frame memory. The
IDC also has a female Radio Corporation of America (RCA) connector th a t is used to
receive video, and a standard 15-pin female video graphics array (VGA) connector th a t is
used to supply red, green, blue (RGB) [69] video output to a monitor.
Figure 4.2 is a block diagram of the video capture subsystem [72]. A video input signal
from an NTSC (or Phase A lternating Line (PAL)) source is digitized by the TVP5022
decoder chip into a standard Y ' C r C r 4:2:2 form at1. The Y ' C r C r is a color space used
to represent digital component video where color is represented by a luma component (Y'),
and two chroma components (C r and

C

r

).

The 4:2:2 notation2 designates the ratio of Y',

C b and C r signals where C r and C r are co-sited and subsampled at half the horizontal
resolution of Y' [53].
*The IT R -R BT.G01 s ta n d a rd defines th e Y ' C b C r color sp ace a n d th e 4:2:2 sam p lin g o rg a n iz a tio n an d
resolu tio n s. T h e B T .656 s ta n d a rd defines th e serial a n d p a ra lle l in terfaces for tr a n s m ittin g Y 'C b O /? 4:2:2
d ig ita l video [29. 100].
2T h e n u m b e r 4 o rig in a te s from a m u ltip lie r of th e B T.601 chosen b aselin e freq u en cy of 3.375 M H z an d
co rresp o n d s to a sam p lin g ra te of 13.5 M H z, a s ta n d a rd frequency for d ig itizin g N T S C o r PA L; sim ilarly th e
2 s co rresp o n d to 6.75 M H z [100]. O th e r co m m o n su b sam p lin g ra tio s include 4:4:4, 4:1:1 a n d 4:2:0 (w here
th e ch ro m a c o m p o n e n ts a re site d in te rstitia lly )

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

24

ID C

Video
Source

TVP5022

FPGA

IDC SRAM
Active Last Active User
Y1
Y1
Y1
C bl
C bl
C bl
C rl
C rl
C rl
Field 1
Field 1
Field 1
Y2
Cb2
Cr2
Field 2

Y2
Gb2
Cr2
Field 2

Y2
Cb2
Cr2
Field 2
EMIF

Ext Int 5

EVM

C6711/13
DSP

F ig u r e 4.2: ID C video c ap tu re subsystem

The 8-bit wide Y ' C r C r pixel stream — interleaved as C b , Y', C r , Y', . . . , is fed into
the FPGA. The FPGA separates and stores the stream into capture frame memory buffers
as two separate fields (odd and even) in three separate blocks (Y', C b , C r ) as shown in
Figure 4.2. The TVP5022 chip also controls all video input timing including a vertical
synchronization signal that generates a CPU interrupt once per frame, and a blanking
signal th a t indicates the presence of d ata on the pixel bus to the FPGA.
The capture frame memory buffers are memory-mapped into the DSP address space
as read-only and are accessed via the EMIF. A triple buffering scheme is used to allow an
application to obtain a new buffer of the most recently captured data without waiting. The
“active” buffer is currently receiving d ata from the TVP5022. The “last active” buffer is
the last buffer th a t was filled by the TVP5022. The “user” buffer is owned and read by the
user application. If the application can m aintain a full 30 fps processing rate, the buffers are
physically walked through in a circular sequence by the FPGA and user application. If the
user application attem pts to access the buffers faster than 30 Hz, then duplicate frames will

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

25

ID C
TVP5022
Monitor

TVP3026
Display FIFO

VSYNC
Ext Int 6

HSYNC
Ext Int 7
C6711/13
DSP

EMIF

SDRAM
Display Buffer 0

Active

Display Buffer 1

Next Active

Display Buffer 2

User

EVM
F ig u r e 4 .3: ID C video display subsystem

be returned. If the application executes slower, then captured frames will be overwritten.
Figure 4.3 is a block diagram of the video display subsystem [72]. Video display is
limited to a max size of 800 x 600 pixels with 8-bits per pixel for grayscale or 16-bits per
pixel for RGB 565 color3. A total output frame display buffer size of 2.88-MBytes (800 x
600 x 16-bits for 3 buffers) is allocated and linked into the D SP’s external memory space.
Timing signals for video readout include a vertical synchronization (VSYNC) signal and a
horizontal synchronization (HSYNC) signal. The VSYNC signal triggers a CPU interrupt
and the associated interrupt service routine posts a display semaphore which is used to wait
for new frames. The HSYNC signal triggers an EDMA event to copy one line of display data
from the display buffer to the IDC display first-in-first-out (FIFO) buffer. The TVP3026
RAMDAC chip then transm its this line to the output port.
Analogous to the video capture system, a triple buffering scheme is used for d ata trans
fers. The “user” buffer is owned by the user application. The “next active” buffer will be
3R G B 505 re p re se n ts color values using 5 -b its for red , 6 -b its for b lue a n d 5 -b its for green

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

26

returned on the next buffer request. The “active” buffer is being used for EDMA transfers,
ff the application attem pts to access buffers too fast, frames will be dropped. If access is
too slow, frames will be displayed repeatedly.

4 .1 .2

V id e o C a p tu r e a n d D is p la y for D M 6 4 2 a n d C 6 4 1 6 E V M s

The DM642 has three on-chip video ports. On the EVM two of the ports are configured
as capture ports (video ports 0 and 1) and one is configured as a display port (video port
2). The capture ports interface to TI TVP5146 [96] and TV P 5150A [91] video decoders.
The TVP5146 supports composite4 or Y /C form at5 inputs, and the TVP5150A supports
composite inputs only on the EVM. The output of the display port is routed through an
FPGA (for functions such as on-screen display or overlays) to a Phillips SAA7105 video
encoder. The SAA7105 drives either NTSC/PA L composite video, S-video, RGB, or liighdefinition component video. Figure 4.4 is a block diagram of the system. Analog input video
is digitized into planar Y ' C r C r 4:2:2 component video and buffered in external memory
similar to the m ethod used for the IDC.
A block diagram of the C6416 EVM is shown in Figure 4.5. Analog video is digitized by
a Conextant, BT835 decoder into a Y'C/A'/f 4:2:2 format and stored by the FPG A into the
capture FIFO buffer. Instead of being w ritten in planar form as on the C6711 EVM, the
captured d ata is stored in C«, Y'. C b ■■- interleaved order. The FIFO is memory-mapped
into the address space of the DSP and accessed via the EMIF. Similarly, output d ata to be
displayed is stored in Y ' C r C r 4:2:2 format and w ritten using a EDMA channel into the
4 C o m p o site v ideo com bines lu m a, ch ro m a a n d sy n c signals in to a single w aveform c a rrie d on a single w ire
p air.
5Y /C h as th e lu m a a n d c h ro m a co m p o n en ts ca rrie d on se p a ra te signal w ire p a irs to re d u ce signal cro sstalk .
Y /C is o ften in co rre ctly referred to as S-video, a m ag n e tic ta p e m o d u la tio n fo rm at.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

27

Video
Input 1

TVP5146
Video
Decoder

Video
Port 1

Composite
S-Video

Video
Input 2

Video
Port 3

FPGA

SAA7105
Video
Encoder

D M 642
TVP5150A
Video
Decoder

Video
Port 2

Composite

SRAM

Video
O utput
Composite
RGB
S-Video
HD

D M 642 EV M

F ig u r e 4.4: DM642 E V M block diagram

display FIFO by the DSP. The pixel stream is then transfered to a Conextant BT864 for
digital-to-analog conversion (DAC) and NTSC/PA L encoding.

4 .2

D e v e lo p m e n t T ools

Several software development tools are used on all of the EVMs, including a C-compiler,
assembly optimizer, and a debugger for visibility into source code execution. These tools are
incorporated into T I’s Code Composer Studio (CCS). O ther rapid prototyping software tools
used include a chip support library (CSL) [81] to configure and control on-chip peripherals,
an image d ata manager (for the IDC) for DMA abstraction, and a C-callable DSP library
(DSPLib) [90] th at contains a collection of highly optimized functions such as the wellknown Fast Fourier Transforms (FFT) [7, 49, 64]. A scalable real-time operating system
(OS) kernel called D SP/BIO S (basic input output system) is used to provide preemptive
multi-threading, hardware abstraction and real-time analysis [80].

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

28

Video
Input

BT835
ADC

Composite
S-Video

Video
O utput
Composite
S-Video

C apture
FIFO
C6416
FPGA

BT864
DAC

SRAM

Display
FIFO

C 6416 E V M

F ig u r e 4.5: C6416 E V M block diagram

Compiler options are used to control speculative loading, auto in-lining thresholds, d ata
alignm ent/placem ent information, and advanced loop optimizations [82], Significant perfor
mance improvements can be gained by using target-specific instructions called intrinsics [93],
Intrinsics are special functions th a t allow certain assembly statem ents to be easily embed
ded in application code. For example to find the maximum value of two variables x l and
x,2 we simply use the optimized in-line intrinsic function call for max2 — max2{ x\ , x2) .

4 .3

T e st-B e d C o m p o n e n ts an d O p er a tio n

A test-bed is used to implement and analyze the real-time Retinex algorithm and to support
testing the algorithm within the context of the EVS for our case study. The baseline test-bed
is composed of
• a standard NTSC video source (for example a video camera, DVD player or VCR),
• a m onitor th at accepts a composite video input to display the processed output,

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

29

C671X
DSP
Video
Source

SDRAM
Monitor

IDC

C 671X E V M
Host

JTAG Em ulator

D M 642/ C 6416 E V M
Video
Source

SDRAM
M onitor

Decoders

DM642
C6416

Encoder

JTAG Em ulator

Host

F ig u r e 4.6: B lock diagram of th e test-b e d — th e H ost P C only provides setup inform ation to th e
EVM ; after in itiatio n , th e D SP executes independently.

• a host personal com puter (PC) running CCS for code development and analysis,
• a JTAG em ulator for communication and debugging, and
• the target DSP on an EVM as discussed in Section 4.1.

Figure 4.6 shows general outlines of the test-bed using the C6711 and C6713 EVMs with
IDCs, and the DM642 and C6416 EVMs. The host PC is not part of the image processing
chain.
General operation of the test-bed is as follows. C code to perform the Retinex is w ritten
on the PC using the CCS software. This code is compiled, assembled and linked into a
common object hie format (COFF) and is downloaded into the DSP on the EVM. Execution
of the algorithm is then triggered from the PC. From this point on, the EVM operates
totally independent of the PC. The functions for performance analysis are (1) video frames
are captured from the source, (2) a 256 x 256 pixel sized portion of the captured frame
buffer is Retinex processed, and (3) the output product displayed.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

30

4 .4

P er fo r m a n c e A n a ly sis

The execution time of the Retinex is measured by using the real-time analysis tools within
D SP/BIOS. These tools are composed of instrum entation code th a t is integrated into the
target application. The code is executed at run time, and the events of interest are stored
in memory on the target. This information is transferred to the host PC for display, further
processing, or post-exec.ution analysis. All instrum entation operations have fixed, short
execution times and communication between the target and host is performed in the back
ground using a low priority idle thread thus minimizing the im pact

011

performance and

program behavior.
The instrum entation modules can be called explicitly by the application through ap
plication programm er interfaces (API)s or implicitly through the calls used internally by
D SP/B IO S[80]. Explicit instrum entation API modules include a statistics (STS) object
manager and a trace (TRC) manager. STS objects store statistics about d ata variables or
system performance including capturing count, maximum, total, and average values in real
time. The TRC module provides a means to enable or disable d ata acquisition in real-time
through querying a set of bits.
Implicit instrum entation is built into D SP/BIOS and allows the user to display several
values including CPU loading. CPU loading is defined as the percentage of instruction
cycles th a t the CPU spends performing application related work — running interrupts,
tasks, periodic functions, performing I/O to the host, or running any other user routine.
For the remaining time, the CPU is considered idle. CPU load is be expressed by

CPUl oad = (cw/ (cw + Cj)) x 100

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

(4-1)

31

where cw and c% are work and idle instruction cycles, respectively. CPU loading can be
viewed graphically in a window with continuous updates if the there are enough idle cycles
to transfer this statistic to the host. Otherwise the values can be obtained after halting the
target and retrieving the stored loading values.

4 .5

R e a l-tim e P a r a m e te r U p d a te s

A useful capability to test the Retinex algorithm is to be able to update param eters in
real-time.

TI provides a mechanism to interact with an application in real-time called

real-time d ata exchange (RTDX)[80]. RTDX plug-ins provide a means to transfer d ata
between a host com puter and DSP devices via the JTAG interface with minimal interference
with the target application. A small RTDX library runs on the target DSP while another
runs on the host.

An application executing on the target makes function calls to the

RTDX target library’s API to send or receive data. The host library, working within CCS,
provides a component object model (COM)6 API for communication. Any object linking
and embedding (O LE)' autom ation client on the host can be used for display or analysis.
We developed our own OLE client using Visual Basic to update Retinex param eters (a),
offset (/J), and the standard deviation of the Gaussian surround (cr).
6C O M is a M icrosoft, dev elo p ed techn o lo g y th a t allow s co m m u n ic a tio n bet ween softw are co m p o n en ts.
‘O L E is a M icrosoft dev elo p ed s ta n d a rd en ab le s th e c re a tio n o f an o b je c t in one a p p lic a tio n th a t can be
linked o r em b ed d e d in a second a p p lic atio n .

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

32

4 .6

R e tin e x T ask W ith in D S P /B I O S

Our code for the Retinex is w ritten to execute as a task within the D SP/BIO S environment.
This allows explicit use of the real-time analysis tools.

In general, two tasks, “main”

and “video processing” are scheduled. First, “main” performs a few initializations, such
as setting up the chip support library, configuring the cache, and opening up an EDMA
channel, and then returns. The “video processing” task is then set to run autom atically by
the D SP/BIOS scheduler. The video processing task consists of the following steps:

• set up several video param eters such as capture and display frame sizes,
• receive a frame from the capture frame buffer,
• call (and waits on) the Retinex processing function,
• display the Retinex o utput and optionally displays the unprocessed frame,
• exchange capture and display buffers, and then returns to read another frame.

STS objects are coded within the “video processing” task to determine the overall exe
cution time of the Retinex processing function. Several STS objects are also placed within
the Retinex processing function to determine internal performance characteristics. This
helps to isolate the prim ary time consumers or “tall-poles” within the algorithm. STS API
calls to set the tim e

011

an STS object, and then to check the change in time after execution

of some portion of code requires approximately 18 and 21 instructions respectively. These
values can be removed for a more accurate measure of performance.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

C hapter 5

O ptim izations and Perform ance
R esu lts
We now describe the optimization techniques we developed and applied to implement the
real-time Retinex. This discussion is the core of our research. Our discussion will focus
on the m ajor algorithm and architecture optimizations th at significantly improved perfor
mance. Additionally, each optimization was developed under the basis th a t it would not
cause any perceptible loss in image quality.
Our baseline algorithm and architecture targets are the single-scale monochrome version
of the Retinex (SSMR) and the C6711 DSP on the C6711 EVM in our test-bed.

The

SSMR is the simplest form of the Retinex and the C6711 has the lowest performance
of the processors in this study. However both allow us to establish our core algorithm
and architecture techniques and provide a basis for future optimizations, extensions, and
adaptation to other platforms. One change in the architecture at this point is to configure
the L2 memory as 32-KBytes of cache and 32-KBytes of SRAM. The 32-KBytes of SRAM
33

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

34

arc sufficient to store all the required variables in our first implementation.

5.1
5 .1 .1

S in g le -S c a le M o n o ch ro m e R e tin e x O p tim iz a tio n s
A p p ly C o n v o lu tio n E q u iv a le n c e

A fundamental component of the Retinex computation is to convolve the input image with
a Gaussian kernel. Good single-scale Retinex renditions are obtained with a large kernel
(a > 80), so performing this operation in the spatial domain is extremely time consuming.
The first, and most obvious, optimization then is to use the well-known equivalence between
convolution in the spatial domain and multiplication in the spatial-frequency domain [7, 20]

f ( x , y) * g(x, y)

F(g, v)G{g, u)

(5.1)

where F and G are the spatial frequency domain representations of / and g respectively.
We apply this concept to convolve an input image with a Gaussian kernel by employing the
2-dimensional M x N forward and inverse Discrete Fourier Transforms (DFTs) [20] defined
by

M -lN -l

=

Ti n ^

^

f{LX , y ) e x ^ [ - j 2 i z { g x / M + uy/N)] and

(5.2)

x = 0 ,r = 0
At —1 N - l

/ ( * ’ V)

=

E

E

cxP[.y2vr(/t:r/A/ + uy/ N)\ ,

(5.3)

fi= 0 u = 0

respectively, to rewrite the SSMR equation as:

R ( x i , x 2)

=

a(log(I(xi, x 2)) —log[iF_1(/(/t, v)F(g, zz))]) —f3.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

(5.4)

The I ( n , v ) and F(fi. v) represent the D FTs of an input image I (x i, x^). and a Gaussian
kernel F( x \ . :r2), respectively, and F

1 represents the inverse DFT.

Exploiting the separability of the D FT and the com putational efficiency of the FFT,
we compute 2-dimensional transforms by applying 1-dimensional FF T s to first the rows
and then the columns of the image. The com putational complexity of the F F T for the
1-dimensional case is 0 ( ( N / 2 ) log(IV)) where N is the size of the complex input [20]. Thus
the com putational complexity of the 2-dimensional case (where the input image dimensions
are IV x N) is reduced to ( ) ( N ‘2 log (A7)). The FF T s are computed using the optimized T1
DSPLib. This library restricts the number of input points to a power of two so we have
chosen to process a 256 x 256 portion of each input frame to closely match the resolution
of the cameras used in our case study as discussed in Chapter 6.
The specific F F T algorithm used is the floating-point radix-2 F F T [90]. TI benchmarks
the number of cycles to compute this operation by

C = (2n log2 n) + 42

(5-5)

where C is the number of cycles, log2 is the base 2 logarithm, and n is the length of the
complex input array [90]. For a 256-point F F T this corresponds to 4138 execution cycles,
thus the C6711 operating at 150 MHz performs this operation in 27.6 microseconds (/rs)
under ideal benchmark conditions. To forward transform the 256 rows of a 256 x 256
image requires £» 7 milliseconds (ms). All of the 256 columns of the transformed image
must then be forward transformed and later, both the rows and columns must be inverse
transformed (IFFT) resulting in a total of 1024, 256-point, forward and inverse transforms
for the input image. The Gaussian kernel must also be forward transformed resulting in and

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

36

additional 512 FFTs, so the total number of transform s is 1,536. Prior to implementation,
we felt th a t of all the calculations performed within the algorithm, performing the 1,536
FFT s would consume the m ajority of the execution time. However, experimental evidence
showed otherwise as we discuss in Section 5.1.3.

5 .1 .2

P r e -C o m p u te th e K ern el

To reduce the number of F F T s performed we developed our first optimization for the al
gorithm. As is commonly done in practice we pre-compute and store the coefficients (or
“twiddle-factors” [61]) used to calculate the F F T /IF F T . Our basic idea then was to use a
similar technique for the Gaussian surround functions. For the SSMR there is only one scale
so we only had to generate one surround function. Two key concepts were implemented th a t
not only reduced the number of FFTs, but also significantly reduced the amount of memory
th a t must be used by the algorithm. First, the Gaussian kernel is directly generated and
applied in the spatial frequency domain thus eliminating the requirement to perform the
F F T of the kernel. Second, the Gaussian is separable and circularly symmetric [63], and is
its own (scaled) Fourier transform so it can be expressed as the product of two 1-dimensional
functions and can be decomposed into horizontal and vertical projections along these di
mensions. Circular symmetry implies th a t the two projections are the same, and the left
half of either projection is the same as the right half flipped about the halfway point. Thus
we only need to keep a single 128-point array of surround values to multiply with the spatial
frequency domain image data. In practice we used a 256-point array to simplify indexing.
Using this array instead of the full spatial frequency domain representation of the kernel
saves ~ 0.5-MByt,es.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

37

5 .1 .3

B a s e lin e A lg o r ith m P e r fo r m a n c e

Using our ideas for the Gaussian kernel we implemented the first DSP version of the Retinex.
Table 5.1 summarizes the actual measured execution time of the overall algorithm and
selected components within the algorithm.

These times were obtained by placing STS

objects, discussed in Section 4.4, within the algorithm.

retin ex
fwdprocessrows
fftrows
logorig
fwdprocesscols
multkernel
invprocesscols
invprocessrows
rtxeq

Time (ms)
1333.42 (0.75 fps)
476.11
9.76
461.72
170.77
13.46
157.83
528.71
507.80

Table 5 .1 : In itial perform ance results from th e first im plem entation of th e SSMR.

The “retinex” item is the total time to perform the SSMR for one frame. The time
to “fwdprocessrows” is the summation of (1) reading a row of image d ata from external
memory into local memory, 2) preparing a complex input array for the FFT , (3) performing
the F F T on the data, (4) storing the transformed row d ata back in external memory for
processing at a later stage of the algorithm, and (5) calculating the logarithm of each pixel
in the row and storing it in external memory. The row FFTs are com puted as the first stage
of transforming the image d ata from the spatial domain into the spatial frequency domain.
The time to perform just the FFTs of the rows (256, 256-pt FFTs) is the “fftrows” item in
the table. The 9.76 ms time is relatively close to the 7 ms benchmark.
The logarithm com putations on the input image are also performed at this point since1

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

38

the input image pixel is already in the cache for the F F T . Like the F F T data, the results
are used later in the com putation of the SSMR so the values are stored in external memory.
These calculations represented as the “logorig” item in the table, take a very long amount of
time, 461.5 ms. This time is much larger than originally anticipated. We discuss a method
th a t we developed and applied to reduce this time in Section 5.1.4.
Similar to the “fwdprocessrows” , the “fwdprocesscols” time is the summation of (1)
reading a column of image d ata (that has already been row transformed) from external
memory, (2) performing an F F T on the d ata completing the 2-dimensional image transform,
(3) multiplying the now spatial frequency domain image data with the kernel, and (4) storing
the processed image d ata back into external memory for further processing at a later time.
The m ultiplication of the spatial frequency domain image d ata with the kernel also takes
a considerable am ount of execution time — 13.46 ms shown as “multkernel” in the table.
We discuss a m ethod that we developed and implemented to significantly reduce this in
Section 5.1.6. The “invprocesscols” and the “invprocessrows” times are the summations
of (1) reading a column/row from external memory (2) performing an inverse F F T on the
column/row, and (3) storing the column/row in external memory. The “invprocessrow”
item also includes the time to calculate the last stage of the algorithm

- the final equation

to generate each output pixel value after all preliminary values have been calculated. The
time for the “rtxeq” item represents this value. The time to compute this stage is also very
long because it contains the second calculation of the logarithm function applied to the
convolved image d ata within it.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

5 .1 .4

P r e - C o m p u t e t h e L o g a r ith m

Directly executing the logarithm function is an expensive operation. The C6711 run-tim e
support library benchmarks 952 execution cycles for a double-precision (64-bit) natural log
arithm calculation and 152 execution cycles for a single precision (32-bit) calculation [85].
Thus with a clock speed of 150 MHz, each double-precision log operation requires 6.35/ts.
This operation is performed for every pixel so the total benchmark time is 415.93 ms corre
sponds closely to the value obtained1. Our initial implementation used this double-precision
function call. However, using the single-precision function does not sacrifice image quality.
Changing to the single-precision function reduced the “log_orig” tim e from 461.72 ms to
69.05 ms, and the “rtxeq” time from 507.80 ms to 92.82 ms. This reduced the total Retinex
execution time, “retinex” , from 1333.42 ms (0.75 fps) to 525.83 ms (1.90 fps). This is a
substantial decrease in the execution time of the algorithm, but the logarithm com putation
is still a significant portion of the to tal time.
To further eliminate this bottleneck we used the fact that the input to the logarithm is
limited to integer values in the range of 0 to 255, and formulated the idea of pre-computing
the logarithm values and storing the values in look-up tables (called log tables). We gen
erated another optim ization by embedding the Retinex param eters o and /I into the log
tables. In observing the SSMR equation from C hapter 2 (repeated here for convenience),

R i ( x i , x 2)

=

a ( lo g ( /i (:r1,x 2)) - lo g (T (x i,x 2) * F ( x i , x 2))) ~

(5.6)

1The slight, d isc re p a n cy is d u e to m in o r a d d itio n a l o p e ra tio n s, such a s d a ta type conversions, t h a t are
perfo rm ed w ith in th e m ea su re m e n t in terv al, a n d loop in d ex in g a n d STS o b je c t o verhead.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

40

we can distribute a and group (3 witli the first term to produce

Ri (x 1 , X2)

=

(a lo g (/i(x i, x 2)) - (3) - (a\og(Ii(xu x 2) * F ( x i , x 2))) .
'------------------ v------------------ '

'------------------------ v------------------------ '

P i { x i , X 2)

Q i ( x i ,X2)

(5.7)

where

P i { x i , x 2) -

(a log(Ii(xi, x 2)) — [3)

(5.8)

for Ii(x,i,x2) € {1, ■•. ,255}

and

Q i ( x i , x 2)

=

(alog(Ii ( x l , x 2) * F ( x u x 2)))

(5.9)

for (Ii(x i ,.t 2) * F ( x i , x 2)) £ {1, — , 255}.

If I i ( x \ , x 2) = 0 then we assign P i ( x \ , x 2) = —/?, and if ( I i ( x i , x 2) * F ( x \ , x 2)) = 0 then we
assign Q i ( x \ , x 2) = 0. We can generate two log tables: the first one for P i ( x i , x 2) and the
second one for Q i ( x \ , x 2). The tables require 1-KByte each, so the additional memory for
two tables instead of one is insignificant. The simple regrouping and embedding of a and (3
eliminates one multiplication and one addition per pixel per band (i in the equations above)
and and could save up to 131,042 execution cycles per band depending upon the order of
im plem entation2. The most im portant reduction though is ju st from using table look-up.
The measurement results are shown in Table 5.2. The time to perform the logarithms is
now 18 times less than when using direct single-precision logarithm calculations! The total
2If p ro p e rly o rd e re d th e m u ltip ly -a c c u m u la te fu n ctio n of th e D S P can p erfo rin th is o p e ra tio n in 65,536
ex ecu tio n cycles p e r b an d .

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

41

execution time is now 385.05 ms which corresponds to 2.59 fps. This is still well below from
our minimum target value of 15 fps for real-time processing.

retin ex
fwdprocessrows
fftrows
logorig
fwdprocesscols
multkernel
invproc.esscols
invprocessrows
rtxeq

Time (ms)
385.05 (2.59 fps)
21.0
9.94
3.67
170.85
12.39
157.65
35.52
14.46

Table 5.2: Performance measurements after using logarithm tables and combining a and p.

As can be seen from Table 5.2, there is a large discrepancy in the time it takes to process a
row versus a column: the “fwdprocesscols” time is eight times th at of the “fwdprocessrows”
time! If the principal cost of com putations were the FFT , the time to perform both of these
operations should be roughly the same. We determined th at the row and column times
are substantially different because the processing is not driven by F F T com putations, but
rather by d ata transfers. To quantify this, additional STS objects were added to directly
measure the column read and write times. To read a complex 256-point integer column
from external memory and to write it back required 148.3 ms. This represents over 93% of
the “fwdprocesscols” time.
The prim ary cause of the discrepancy between row and column execution time can be
determined by examining the memory requirements of the algorithm and the DSP architec
ture. The most, efficient d ata processing operations occur when the processor has very fast
access to the data, i.e., when the d ata is located in the cache or in L2 memory. While we

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

42

do not have direct write access to the L IP or LID caches, we do have access to, and some
control over, the next fastest access location: L2 memory. The C6711 has a 64-KByte L2
memory th a t can be configured as cache, SRAM, or a combination of the two as discussed
in Section 3.1. Optimum performance can be obtained if all of the transformed image d ata
is located in the L2 memory, but unfortunately, 64-KBytes is nowhere near the required ca
pacity: the input image itself is 64-KBytes. Additionally, the DSPlib F F T routines require
input and output d ata in complex format, i.e. each point must have a real and imaginary
(zero for our input purposes) component, which doubles the storage size. Also, the d ata is in
floating point (four byte) format, so the actual memory required to store a transformed 256
x 256 image is 512-KBytes. Thus the image d ata must be kept and fetched from external
memory.
Operating directly on d ata located in external memory incurs a large performance
penalty, so for performing the F F T efficiently on a row of an image requires reading all
the contiguous pixels of the row from external memory into a buffer located in L2 memory.
The first pixel read of a row is accompanied by reading in 3 additional pixel points into
the 32-Byte line size of the LID cache. Accessing the first pixel causes LID cache and L2
memory misses, but accessing the next three pixels in the row returns a cache hit and the
d ata is retrieved in one clock cycle. To process a column requires accessing lion-contiguous
pixels with a stride difference equal to the number of columns. So, transferring a column
of image data from external memory generates a LID and L2 memory miss for each pixel
thus severely degrading performance. Additionally, we cannot take advantage of any tem 
poral locality for the d ata since we are only using the d ata once at this point within the
algorithm. In order to improve the L2 memory transfer time for column-wise image d ata

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

43

we must change the mechanism for access.

5 .1 .5

U s e D M A t o T ra n sfer C o lu m n s

Our next idea was to use the EDMA controller in the C6711 to handle d ata transfers
between L2 memory and external memory. This saves processor cycles used to transfer the
data, and, since the transfer can be performed in the background, this enables overlapping
processor execution with d ata transfers if coordinated correctly. The chip support library
for the C6711 provides the capability to perform 2-dimensional transfers by specifying the
number of bytes per line, the number of lines, and the number of bytes between the start
of one line and the next. If we set these param eters to transfer a column of image data, we
can exploit the efficiency of this transfer to speed up column processing of the image.

r e tin e x

fwdprocessrows
fftrows
logorig
fw d p r o c e s s c o ls

multkernel
in v p r o c e s s c o ls

invprocessrows
rtxeq

Time (ms)
134.44 (7.44 fps)
19.05
9.90
3.27
39.24
9.84
28.73
47.86
16.21

T a b le 5.3: P erform ance results after using 2D DM A d a ta transfers.

The improvements gained by using this m ethod are shown in Table 5.3. The total time to
transfer and perform processing on the columns is now only 67.97 ms as compared to 328.5
ms earlier, thus reducing the total SSMR execution time down to 134.44 ms (7.44 fps). Note
th a t the “multkernel” execution time is reduced because the processor does not have to wait
for data to arrive from external memory to begin execution. However, the “invprocessrows”

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

44

tim e lias increased. This occurs because the next processing stage must now wait until
the last column transfer is complete to begin execution. In the prior implementation this
function was part of the “invprocesscols” but was composed of execution cycles to transfer
d ata rather than wait cycles. We discuss methods th a t we developed to eliminate this and
other wait cycles in Section 5.1.8.

5 .1 .6

R e d u c e G a u s s ia n K e r n e l C o m p u ta tio n s

A property of the Gaussian function th a t we can exploit to significantly improve perfomance
is th a t the tails of the function rapidly decrease to zero for large a. This implies th a t a large
percentage of values in the 256-point Gaussian kernel array will be zero. If we preset (to
zero) the buffer th a t will hold the convolution result, the loop to process the convolution can
be term inated early with proper indexing and checks for the first zero value in the surround
array. Table 5.4 shows the result of implementing this optimization. The time to multiply
the kernel is reduced from ~ 9 ms to 150//S with a = 80. We should note th a t this tim e is
dependent upon the extent of the surround, and the performance will degrade, ultim ately
back to 9 ms, as narrower surrounds3 are chosen.
We also discovered th a t performance can be improved by changing the way one initiates
the complex array. To generate the complex input array for the F F T we must interleave a
real (image data) value with an imaginary (zero) value. Ordinarily one would simply zero
out the array by using some function call and then fill in every even indexed array value
with the real components. We found th a t it is more efficient to write the real component
and then immediately write zero into the next array value. This occurs because we only
!A narrow surround in the spatial domain is wide in the spatial frequency domain and vice versa.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

45

r e tin e x

fwdprocessrows
fftrows
logorig
fwdprocesscols
m u lt k e r n e l

invprocesscols
invprocessrows
rtxeq

Time (ms)
125.04 (7.41 fps)
18.99
9.91
3.29
29.33
0.15
28.96
47.76
16.29

T a b le 5.4: Perform ance results after using 2D DM A d a ta transfers.

have to load and access the input array in the LID cache once instead of twice plus function
call overhead for the first method.

5 .1 .7

M e r g e A lg o r ith m C o m p o n e n ts

The next significant performance increase was obtained by identifying redundant transfor
mation cycles in the algorithm. In our original implementation we performed the following
sequence of operations:

• For all rows: read in row, FFT , and write the result to external memory,

• For all columns: read a column, FFT, and write the result to external memory,

• For all columns: read a column, convolve with the Gaussian kernel, and write the
result to external memory

• For all columns: read in column, IFFT, and write the result to external memory,

• For all rows: read a row. IF F T , and write the result to external memory.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

46

The remainder of SSMR calculation is then performed. We can take advantage of the
independence of each column of image d ata by merging some of the preceding steps and
thus eliminating several stages of d ata transfers. As soon as we have performed the F F T of
a column, we can continue processing this column, multiplying it with the kernel, and then
immediately perform an IF F T of the column. The processing stages then become:

• For all rows: read in a row, FFT , and write the result to external memory.
• For all columns: read in a column, FFT , multiply with the Gaussian kernel, IFFT,
and write the result to external memory.

• For all rows: read in a row, IFFT, and write the result to external memory.

This saves four read and write transfers to external memory. Table 5.5 shows the results
of implementing this optimization. The “fwdprocesscols” and “invprocesscols” items are
now merged into the “processcols” item. Additional optimizations were also performed to
reduce the “rtxeq” time. This includes moving all tables into L2 memory and performing
a 1-dimensional DMA transfer for the final o utput values. The total execution time of the
algorithm is down to 83.06 ms (12.04 fps). This is now approaching real-time performance.

5 .1 .8

M in im iz e D a t a T ra n sfer O v e r h e a d

We then focused on formulating and applying a m ethod to minimize the overhead of trans
ferring data between external and internal memory. Instead of using processor cycles to
perform this function, we used the DMA capability within the processor to perform all
external-to-internal memory transfers. We were already using this function to perform 2dimensional column transfers as mentioned in Section 5.1.5 and 1-dimensional array trans-

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

47

re tin e x
fwdprocessrows
fftrows
logorig
processcols
multkernel
invprocessrows
rtxeq

Time (ms)
83.06 (12.04 fps)
17.42
9.8
3.25
41.14
0.16
34.11
5.3

T a b le 5.5: P erform ance resu lts after m erging algorithm stages. Since th e forw ard and inverse
colum n execution tim es are effectively m erged to g eth er, th e tim e to process colum ns is now in item
“processcols”

fers for the final output values of the algorithm as discussed in Section 5.1.7. We now add
additional DMA transfers for the row d ata transfers of F F T d ata and for the logarithm
of the input image data. Storing the logarithm of the input d ata requires 256-KBytes, far
larger than the memory available in the L2 memory, so these values must be kept in external
memory.
Performing DMA transfers and waiting for completion obviously reduces the effective
ness of using DMA. To avoid this we implemented a double buffering scheme to move from a
d ata I/O -lim ited algorithm to a execution cycle-limited algorithm. As noted earlier, DMA
allows d ata transfers to occur independently or in the background of any processor activity.
We developed an algorithm and implemented a series of buffers so th a t as we process one
buffer, we simultaneously transfer in the next d ata to be processed. This double buffering
scheme was used for all DMA transfers and removed the requirement to wait for any DMA
transfer. W ithout having to wait, reading in more than one unit of transfer (e.g. two rows
or two columns) did not improve performance.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

48

5 .1 .9

U s e C a c h e -o p tim iz e d F F T s

After all of the previous optimizations, we returned to trying to improve the FFT . We
identified and applied a more efficient form of the F F T algorithm, a cache-optimized (SP x
SP) algorithm th a t allows the use of mixed radix FF T s that can be calculated in multiple
passes. A 256-point F F T only needs one pass and can be effectively calculated using the
cache-optimized F F T in radix-4 mode. The benchmark equations for the cache-optimized
F F T suggested th a t we could obtain b etter performance from this version versus the radix-2
form. Ttie number of cycles C to compute the F F T using this equation is given by:

C = (3|"log4(n - l)]n ) + (21|"log4(n - 1)] + (2n) + 44

(5.10)

where C is the number of cycles, log4 is the base 4 logarithm, and n is the length of the
complex input array. For a 256-point F F T C = 2923 cycles, or 19.5 /rs, corresponding to a
30% increase in F F T performance. To forward transform the 256 rows of an image 256 x
256 image takes

5 ms.

Implementing the double buffering scheme and changing the F F T algorithm for the
C6711 allowed us to achieve our final C6711 SSMR execution time of 48.33 ms or 20.7 fps.
Table 5.6 shows the timing for the individual components of the algorithm. A sample output
image frame from a video taken of a bookcase is displayed in Figure 5.1. The input image
is shown on the left while the Retinex enhanced image is shown on the right. The enhanced
image has greatly improved contrast and sharpness. Details th a t are indistinguishable in
the original are easily noticed in the enhanced image.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

49

r e tin e x
fwdprocessrows
fftrows
logorig
processcols
multkernel
invprocessrows
rtxeq

Time (ms)
48.33 (20.7 fps)
12.83
6.98
3.20
20.55
0.16
14.92
5.59

T a b le 5.6: F in al SSM R perform ance results using th e C6711 DSP.

5.2

M a p O p tim iz e d S S M R t o C 6 7 1 3

To improve and compare performance we mapped the same optimized SSMR code developed
for the C6711 onto the C6713. Considering the similarity in architectures this should provide
a near linear increase in performance corresponding to the increase in clock speeds between
the devices. Thus performance should improve by 50% (225/150) and the expected frame
rate should be close to 31 fps. The larger L2 memory on the C6713 is not used because
all of the memory allocated in the current im plem entation fit, in 64-KBytes, and the extra
192-KBytes of L2 SRAM on the C6713 are not large enough to move any of the significant
d ata structures into on-chip memory.
After porting the code to the C6713, the algorithm ran successfully and we obtained a
frame rate of only 28 fps. The 35% increase is sub-linear. This occurs because the C6713
EVM has a slower EMIF clock th a t controls the transfer rate to external memory. The
C6711 EVM uses a 100 MHz EMIF clock while the C6713 EVM uses a 90 MHz clock. This
reduces the external d ata transfer rate to the extent th a t the processor must now wait for
DMA transfers to complete.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

50

F ig u r e 5.1: C a p tu re V ideo F ram e w ith in p u t from cam era on th e left, and R etinex o u tp u t on th e
right. R etinex p a ra m e te rs are a = 175, j3 = 135, and a = 80 — note th a t we are nearly reaching
th e noise lim it of th e cam era.

5 .3

M a p O p tim iz e d S S M R to D M 6 4 2

Although either of the C671X platforms would perform adequately for many applications,
it is obvious th a t neither has the performance capability to meet real-time multi-spectral,
multi-scale Retinex processing requirements. So next, we ported the SSMR algorithm to the
DM642 platform. Although the DM642 uses different image capture and display drivers,
DMA mechanisms, and F F T algorithms than the C6711/C6713, the core structures and
methods developed and implemented on the C6711 remained the same. Directly comparing
DM642 MIPS with the C6711 shows a potential four-fold increase in performance. However,
other factors such as extra com putations to handle fixed point arithm etic, and different
processor specific instructions, libraries, and EMIF bus speeds affect performance. These
modifications do not allow a direct comparison with C67X performance but we should
anticipate approximately 70 to 90 fps performance.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Fixed-point arithm etic limits the dynamic range of the DM642 to 231 — 1. This is
sufficient for some portions of the algorithm. For example, the input to a 256 point radix-4
F F T is processed in 4-stages where each stage gives 2 bits of growth. Our 8-bit image data
input will then only grow to a maximum of 16 bits for one forward transform. Since we
generate a 2-dimensional Fourier transform, a second 256-point F F T is also performed. This
increases the growth to 32 bits which still fits in a standard integer d ata type. However,
the now spatial frequency domain image d ata is then multiplied with a kernel. The largest
numbers from the F F T operation are on the order of 108. The smallest numbers from the
normalized spatial frequency Gaussian kernel are truncated4 at 10~6. Thus we must process
values on the order of 1014 which, without scaling, is well beyond the capability of 32-bit,
fixed point representation.
To perform scaling we invoke a few simple arithm etic conventions. For example, to
multiply an integer number I by 0.6913 (which equals log(2)) one could perform

R = ((1*6913) + 5 ,000)/10,000

(5.11)

where 5,000 is added to perform rounding. If I = 46, floating-point, multiplication yields
31.7998 while our fixed-point method yields R = 32. Because a shift left operation is
equivalent to division by 2, we can improve the efficiency of this operation by dividing by a
number that, is a power of 2. Using 8192 (213) in our previous example our new multiplier
becomes 0.6913 * 8192 = 5663.1296. We could chose 5663 or 5664 depending upon which
S ig n ific a n t, d ig its b eyond 10-6 a re tr u n c a te d w ith o u t affectin g im age quality.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

52

value is more accurate. Choosing 5663, our scaling equation above becomes

R = ((7*5663) + 4 ,096)/8192

(5.12)

and if I — 46 again, then R = 32. When scaling one must also be careful with proper
selection of the storage classes used to hold interm ediate results. Recall from Section 5.1.2
th a t we exploit the symmetry of the Gaussian to save memory space, so to compute the
kernel values in the spatial frequency domain we multiply two of the properly indexed
array values together. We scale each individual array value by 2 19 in order to retain as
much resolution as possible, so the final spatial frequency domain kernel values are on the
order of 238. Multiplying by the maximum spatial frequency domain image values (~ 224)
results in values on the order of 262. Fortunately the TI compiler supports 64-bit signed
and unsigned integer (long-long) d ata types.

An alternative, but less efficient, method

to minimize the size of internal values is to generate the inverse of the spatial frequency
domain kernel values and use division instead of multiplication. The division operation is
implemented on the DM642 by repeatedly issuing a conditional subtract operation (SUBC)
instruction. After carefully balancing scaling and truncation tradeoffs a fixed-point version
of the algorithm was implemented with the log values scaled by 220 and Gaussian kernel
table values are scaled by 219. These values maximize the retained precision without causing
overflow in intermediate or final output calculations.

5 .3 .1

A p p ly I n tr in s ic s

Another algorithm optimization implemented at this stage was to use intrinsics, originally
mentioned in Section 4.2, at strategic points within our code. For example, to clamp final

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Retinex output d ata to values between 0 and 255, we used two intrinsics, min2 which returns
the lesser of two inputs and max2 which returns the greater of two inputs, and formed an
instruction similar to min2(max2(output_value,0),255). Measuring the performance5 of this
instruction using STS objects results in 8.8 ns per pixel (2.24/txs per 256 x 256 image).
As a comparison, to clamp the Retinex output using a standard if-then-else expression (if
output .value < 0 output.value = 0, else if output.value > 255 output_value = 255) requires
27.4 ns per pixel (7.01/xs per image). The instruction using intrinsics is over 3 times faster.
After implementing the proper scaling operations and embedding intrinsics, we achieved an
execution time of 17.89 ms (55.89 fps) for the SSMR on the DM642. This is still below our
anticipated 70 to 90 fps.

5 .3 .2

M o d if y t h e A r c h ite c tu r e

We determined th a t I/O was again limiting performance. The faster DM642 processor,
even performing the additional scaling calculations, executes the algorithm quicker, thus at
various points in the code the processor now has to wait for DMA transfers to complete. We
eliminated this by making an architectural change on the DM642 EVM. The default EMIF
bus rate is 133 MHz. We were able to increase the EMIF bus rate to a chip maximum 200
MHz, effectively over-clocking the SDRAM, by strapping the appropriate resistors onto the
DM642 EVM module and changing memory access timing param eters. Implementing this
modification increased SSMR performance to 69.15 fps effectively meeting our anticipated
performance.
°T h is m ea su re m en t w as p erfo rm ed on th e C 6416 p ro cesso r, b u t th e ra tio re m a in s th e sam e for th e o th e r
processors.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

54

5 .4

M u lti-S p e c tr a l M u lti-S c a le R e tin e x O p tim iz a tio n s

The performance achieved for the SSMR on the DM642 platform provided a baseline to
pursue real-time multi-spectral, multi-scale Retinex (MSR) performance. Expanding from
a single scale to multiple (three) scales primarily involves two additional com putational
requirements — (1) performing the additional convolutions and (2) weighting and combining
the convolution results. We implemented the same technique previously developed for the
SSMR except we pre-compute a series of Gaussian kernels directly in the spatial frequency
domain and store the values in tables. The range of a was constrained to values between
from 5 to 260 in steps of 5. Each scale would then use a pointer to the appropriate table of
the associated cr value. Since a is static for each scale, the pointers are set prior to calling
the Retinex function. The total size of all the Gaussian tables is now 52-KBytes. We could
not keep this number of tables in memory on the C671X processors.

5 .4 .1

R e u s e T r a n sfo r m e d I n p u t I m a g e

Since the same input image d ata is convolved with each kernel, the optimum stage to
perform this function is as each column is read from external memory and transformed.
The sequence of operations at the convolution stage then becomes

• Read a column, FFT, multiply with kernel 1, IFFT, and DMA result to external
memory.
• Multiply the same column with kernel 2, IFFT, and DMA result to external memory.

• M ultiply the same column with kernel 3, IFFT, and DMA result to external memory.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

This not only reuses the same spatial frequency domain image data, but also allows DMA
transfers to be overlapped with processor activity. Double buffering is still implemented on
column reads to ensure th a t column d ata is always present in local memory for process
ing. We improved the buffering scheme by accumulating the first spatial frequency domain
column earlier during the first stage of row processing. The amount of memory needed to
hold the convolved image d ata is now 1.5-MBytes — three times the previous requirement
of 512-KBytes. After the convolution stage, the image d ata for each scale must then be
transferred back into local memory, inverse transformed, weighted, and combined with the
other scales. Again, we use DMA to retrieve the d ata back into local memory, and double
buffering to perform this transfer in the background.

5 .4 .2

R e d u c e C o m p u ta tio n s

One m ajor optimization idea we developed and applied for weighting and combining the
scales is to rearrange the Retinex equation to reduce the number of operations th a t must
be performed. It has been shown th a t using equal weighting factors provides good Retinex
enhancement in many conditions [32]. We exploit this fact by distributing the weighting
factors in the Retinex equation

Ri {xu x 2) =

I<
^ W ^ lo g t/itx i,.'^ ) ) - lo g (A (x i,^ 2 ) * Fk( x u x 2)))

(5.13)

k= 1

K

=

K

] T W fcO ogt/iO ri,^))) - ^
A -= l

W,.(log(/2( * i,* 2 ) * Fk( x u x 2))) (5.14)

A := l

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

56

and noting th a t if the W^s are equal (Wk = W ) and 22k=i

= 1 then our equation

becomes
K

R i ( x i , x 2) =

(\og(Ii ( x i , X 2 ) ) ) - W ^ 2 ( l o g { I i { x 1, X 2 ) * F k( x l , x 2)))

(5.15)

fc=i

This saves two logarithm com putations, two subtractions, and two multiplications per pixel.
An additional reduction in calculations is gained by combining the proper weighting
factors into the tables already used for the two pre-computed log tables discussed in Sec
tion 5.1.4. To pre-compute the second log table (the log table combined with /I only) values,
if two or three scales are used, then simply divide these values by the associated number of
scales, 2 or 3, respectively.
The next requirement is to add to the multi-scale algorithm the capability to process
in real-time multiple (three) spectral bands, i.e. color video. This addition is not quite as
simple as just executing the same multi-scale algorithm on each band, particularly when
embedding optimizations to improve performance. First, to perform color processing the
image d ata should be in the RGB color space. The video decoders and encoders on the
EVMs only work in the Y'C/;C/i>color space. For monochromatic processing we only have to
extract the lum a component from the Y ' C b C r input stream. As discussed in Section 4.1.2,
on the C671X and DM642 EVMs the Y ' C b C r d ata is stored in planar format so only a
pointer is required to address the Y ' component. On the C6416 EVM the Y 'C /jC /; d ata is
stored in interleaved format so the Y ' component must be extracted from the frame data.
This is easily accomplished by using 2-dimensional DMA calls discussed in Section 5.1.5.
However whether the image d ata is in planar or interleaved format, the Y ' d ata does not
need to be converted as it does if color processing is to be performed.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

57

For color processing, the Y ' C r C r input d ata must be converted [53] into the RGB 888
color space6 and the processed RGB d ata must be converted back into the Y 'C b C r color
space for output into the video encoders. The following equations from Poynton [53] are
used to convert between Y ' C r G r and gamma-corrected7 8-bit com puter8 RGB (R ’G ' B '):

R'

= 1.1644(Y' - 16) + 1.5960(G/? - 128)

(5.16)

G'

= 1.1644(Y' - 16) - 0.3918(Cb - 128) - 0.8129(6/;? - 128)

(5.17)

B'

= 1.1644(Y/ —16) + 2.0172(6/# — 128)

(5.18)

then converting into fixed-point format using a scaling factor of 213, the conversion equations
above become

R'

= ((9539(Y '-

16) + 13075(6/# - 128) + 4096) » 13)

(5.19)

G'

= ((9539(Y' -

16) - 3209(6/# - 128) - 6660(6/# - 128) + 4096) » 13)

(5.20)

B'

= ((9539(Y/ -

16) + 16525(6/# - 128) + 4096) > 13).

(5.21)

To encode 8-bit Y'C/jC/i’ from R 'G 'B ' we use the following equations:

Y'

= 0.2568R' + 0.5041G' + 0.0979B' + 16

(5.22)

CB

= —0.1482R' - 0.2910G' + 0.4392R' + 128

(5.23)

CR

= 0.43927?/ - 0.3678G' - 0.0714R' + 128.

(5.24)

6In RGB 888, each pixel is represented by an 8-bit red, green, and blue component
7G an irn a -c o rre ctio n refers to th e n o n -lin e a r tra n sfe r fu n ctio n ap plied to R G B values in m o st im aging
system s. T h is is used to m im ic p e rc e p tu a l resp o n se [53]
“C o m p u te r R G B u ses th e full 8-bit, ra n g e w ith black a t code 0 a n d w h ite a t co d e 255.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

58

Again, converting into fixed-point, format, using a scaling factor of 213 yields

Y'

=

((2 1 0 4 # + 4 1 3 0 # + 8 0 2 # + 4096) > 13) + 16

(5.25)

CB

=

( ( - 1 2 1 4 # - 2 3 8 4 # + 3 5 9 8 # + 4096) » 13) + 128

(5.26)

((3 5 9 8 # - 3 0 1 3 # - 5 8 5 # + 4096) > 13) + 128.

(5.27)

CR =

All RGB and Y ' C r C r values should be clamped between 0 and 255, and 16 and 235 respec
tively. In practice we simplify these equations by eliminating the redundant calculations.

5 .4 .3

B u ffe r A c r o s s S p e c tr a l B a n d s

Another technique we developed to m aintain our I/O performance is to modify our row
doublerbuffering scheme to buffer d ata across spectral bands. This modification is done
only on the row output processing stage since we need d ata simultaneously from all three
bands during this stage. When processing a row of d ata for the red spectral band, instead
of performing a DMA of the next row of red spectral data, we DMA the next row of green
spectral data. Similarly, when processing the green band, we DMA the next blue band, and
when processing blue band, we DMA next red band. So our buffering sequence becomes

• DMA the red band

• loop start,: DMA the next, green band; process the red band
• DMA the next blue band; process the green band

• DMA the next red band; process the blue band; combine bands; end loop.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

59

After all three channels for a row are processed, i.e. the blue band is complete, the
bands are combined and converted to Y ' C b C r . This optimization continues to maximize
the processing load by keeping d ata transfers in the background.

5 .4 .4

A llo c a te L o g V a lu e s in L 2 M e m o r y

The additional transfer buffers and tables used in the optimizations discussed so far are
statically allocated in L2 memory. All of these d ata structures easily fit in the 256-KBytes
of L2 memory

011

the DM642 with a nominal allocation of ~ 175-KBytes used in our

implementation. However the DM642 L2 memory is still not large enough to hold all of the
processed image d ata at any stage of the algorithm. As mentioned in Section 3.4, the C6416
is not only faster but has a larger L2 memory of size 1-MByte. We exploit this feature by
keeping all of the logarithm of the original image data, 768-KBytes, in L2 memory. This
uses nearly all of the L2 memory with a total allocation 1,011,904 Bytes, but by keeping this
d ata local we eliminate all of the associated DMA transfers and thus improve performance.

5 .4 .5

M S R P e r fo r m a n c e R e s u lt s

To measure the performance of the MSR we used both the DM642, with EMIF bus speeds
of 133 MHz and 200 MHz, and the C6416 processors

011

their respective EVMs in our test

bed outlined in Chapter 5. The graphs in Figures 5.2, 5.3, and 5.4 show the performance
obtained

011

the processors for the Retinex with 1 to 3 scales and 1 to 3 spectral bands. The

vertical lines are the cutoff points for real-time performance based on 15 fps and 30 fps. The
same d ata is shown in tabular form in Table 5.7. Execution time is shown in milliseconds,
and in frames per second in parenthesis. The values for the Gaussian surrounds, a, are 5

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

60

for 1 scale, 5 and 80 for 2 scales, and 5, 80 and 200 for 3 scales. The gain a and offset (3
values are 250 and -100 respectively.
For 1 spectral band, implementations of the algorithm on both processors meet the 15
fps, and 30 fps real-time requirements for all scales. For 2 spectral bands implementations
on both processors again meet the 15 fps target. W ith 2 or 3 scales, only the C6416 meets
the 30 fps target. The DM642 with a 200 MHz EMIF only meets this target for 1 scale.
For 3 spectral bands, only the implementation on the C6416 with 1 scale meets the 30 fps
target. Performance for 3 bands, 3 scales is 20.25 fps. For the 200 MHz EMIF DM642,
3 band 3 scale performance is at 13 fps, just missing the 15 fps target.

Interestingly,

although all implementations on each processor performed linearly, the slopes progressively
decrease from the plots for the C6416 to the 200 MHz DM642, and to the 133 MHz DM642
respectively on all three graphs. This may be due to the fact th at more d ata is kept local to
the processor for the C6416. W hen there are more Retinex computations, there is more d ata
to be transferred, and so the algorithm becomes more I/O driven, degrading performance
at a faster rate than if it was more controlled by processing cycles as it is for the C6416.
For comparison purposes we placed STS objects in the code on the C6416 to measure
the execution time of the different stages of algorithm like we did earlier for the C6711.
Table 5.8 shows the best single-scale, monochrome Retinex performance on the C6711 and
on the C6416 DSPs. Note the significant decrease in the time required to process the FFT.
The specific F F T used from the DSPLib for the C6416 is the mixed-radix 16x32-bit F F T 9.
9T h e 16 x 3 2 refers to th e b it w id th of th e coefficients, a n d th e input, a n d o u tp u t d a ta , respectively.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

61

Frames per Second
100

50

30

20

15

3 n ------------------------------------------------------------------------------------------------

1----I
I
I
I

2 1-----------------------------------------------------------------------------------------------------------

S

c

a

l

e

s

i

i
i

i
1 1-----------------------------------------------------------------------------------------------------i
i
i
i
1 l
70

F ig u r e 5.2: R etinex perform ance in tim e (b o tto m axis) and fram es per second (top axis) to process
1 sp ectral b and of im age d a ta on DM642 w ith 133 M Hz E M IF (d o tted line), DM642 w ith 200 MHz
E M IF (dashed line), and C6416 (full line).

The benchmark number of cycles to compute this F F T is given by [83]:

C = (13n/8 + 2 4 )(log4(n) - l]n ) + (n + 8)1.5 + 27.

(5.28)

For n = 256, the length of the F F T , C = 1743 cycles. This corresponds to 1.743//S on
the 1 GHz C6416. So based on the benchmark equation, to forward transform the 256
rows of a 256 x image takes ss 446/rs. Our measured F F T time is 516/is nearly meeting
the benchmark. Also note the significant decrease in time for “rtxeq” is due to the use
of intrinsics, loop index and equation simplifications, and the increase in processor speed.
Finally, we also note the increase in time to multiply by the kernel. This occurs because
of the scaling operations performed at this stage of the algorithm and the required use of

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

62

Frames per Second
100

50

10

20

30

20

15

3

2

1

30

40

50

60

70

Time (ms)
F ig u r e 5.3: R etin ex perform ance in tim e (b o tto m axis) and fram es p er second (top axis) to process
2 sp ectral b ands of im age d a ta on DM642 w ith 133 MHz E M IF (d o tted line), DM642 w ith 200 MHz
E M IF (dashed line), and C6416 (full line).

inefficient long-long d ata types to hold intermediate values. As one final execution time
measure we also tested the algorithm without any internal measurement instrum entation
and only for 1 scale and 1 band. W ith these simplifications we obtained an execution time
of 8.9 ms (112.36 fps).
We also measured CPU load for the C6416. Unlike the previous Retinex timing measures
which only encompass the Retinex task, this is a global measure which includes frame
acquisition. Table 5.9 shows the values obtained under different Retinex configurations.
For the lower com putational requirement configurations (1 spectral-band,or 2 spectral bands
and 1 or 2 scales, or 3 spectral bands and 1 scale) the processor is underutilized. Only a
small percentage of these unused execution cycles are spent waiting for DMA to complete

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

63

Frames per Second
100

50

10

20

30

20

15

10

3

2

1

30

40

50

60

70

80

90

100

Time (ms)
F ig u r e 5.4: R etinex perform ance in tim e (b o tto m axis) and fram es p er second (top axis) to process
3 sp ectral bands of im age d a ta on DM642 w ith 133 MHz E M IF (d o tted line), DM642 w ith 200 MHz
E M IF (dashed line), and C6416 (full line).

due to the highly optimized code at this point. Only one or two DMA wait statem ents have
to be inserted into the algorithm to achieve correct operation, and this is only for the single
band, single scale case. The m ajority of the unused execution cycles are spent is simply
waiting for the next frame from the input camera.
To visible dem onstrate the performance of the real-time algorithm we processed a video
of an outside scene at NASA LaRC in Hampton, Virginia. The video was taken on November
the 8th, 2005 between 5:15 PM and 5:30 PM using a standard Sony TRV-20 videocamera.
Sunset on this day was at 5:02 PM. For presentation in this dissertation, we have extracted
3 snapshots from the processed video. The first snapshot, shown in Figure 5.5 is taken 40
seconds into the video. The second snapshot, shown in Figure 5.6, is taken 6 minutes and

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

64

Retinex Execution
Time Table
D M 642/133 MHz
1 scale DM642/200 MHz
C6416
D M 642/133 MHz
2 scales DM642/200 MHz
C6416
DM642/133 MHz
3 scales DM642/200 MHz
C6416

EMIF
EMIF
EMIF
EMIF
EMIF
EMIF

1
17.89 (55.9)
14.46 (69.1)
9.24 (108.2)
25.54 (39.1)
19.98 (50.1)
12.68 (78.9)
33.11 (30.2)
25.79 (38.8)
17.03 (58.7)

Bands
2
35.54 (28.1)
28.39 (35.2)
17.5 (57.1)
50.55 (19.8)
39.74 (25.2)
25.06 (38.9)
66.32 (15.1)
51.85 (19.3)
33.11 (30.2)

3
52.77 (18.9)
41.84 (23.9)
25.66 (38.9)
75.04 (13.3)
58.96 (16.9)
36.83 (27.1)
98.25 (10.2)
76.86 (13.0)
49.37 (20.3)

T a b le 5.7: M easured R etinex perform ance on DM642 and C6416 processors. T he 133 and 200 refer
to th e clock speed of th e E M IF bus. M easurem ent u n its are in b o th milliseconds, and fram es per
second in parentheses.

retin ex
fwdprocessrows
fftrows
logorig
processcols
multkernel
invprocessrows
rtxeq

C6711
Time (ms)
48.33 (20.7 fps)
12.83
6.98
3.20
20.55
0.16
14.92
5.59

C6416
Time (ms)
9.24 (108.23 fps)
1.3
516/xs
141/rs
6.43
2.18
1.49
571/rs

T a b le 5.8: C om parison of final SSM R perform ance using th e C6711 and th e CG416 D SPs.

28 seconds into the video. The third snapshot, shown in Figure 5.7, is taken 14 minutes
and 28 seconds into the video. The unprocessed video frames are on the left. They show
the scene as captured by the video camera. The progressive darkening of these images is
due to the sunset. The real-time Retinex enhanced frames using the C6416 EVM are on
the right. The processed frame in the first snapshot shows a m oderate enhancement over
the unprocessed scene. Note the non-linear dynamic range compression performed by the
enhancement. The very dark areas are enhanced without severe blooming around the bright

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

65

Spectral Bands
1
1
1
2
2
2
3
3
3

Scales
1
2
3
1
2
3
1
2
3

C6416 CPU Load
31.24%
39.97%
51.22%
52.37%
71.48%
100.00%
73.77%
100.00%
100.00%

T a b le 5.9: C6416 C P U L oading for different R etinex configurations.

car lights. In the unprocessed frame of the second snapshot, the colors are nearly completely
indeterminable and objects are becoming difficult to distinguish. The processed frame of
the second snapshot retains most of the contrast and brightness of the first processed frame.
Colors are still clearly perceptible and objects are still defined. For example, the vehicle
that is nearly unseen in the unprocessed image is clearly seen in the processed image. The
unprocessed frame of the third snapshot is almost completely dark. The processed frame
of the third snapshot is nearly reaching the noise limit of the camera, but still provides
significant information about the scene. Objects such as the wind tunnel spheres, th a t are
not discernable in the unprocessed frame are clearly perceived in the processed frame.
The aim of our research was to achieve real-time multi-scale, multi-spectral Retinex
image enhancement. We started by developing, implementing and analyzing several algo
rithm optimizations, and using the C6711 DSP we achieved 20.7 fps performance of the
single-scale, monochrome Retinex. Building upon this effort, we continued to optimize and
refine the algorithm and configuration of the architecture, and using the C6416 DSP we
were able to achieve 20.3 fps performance of the multi-scale, m ulti-spectral Retinex.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

66

F ig u r e 5.5: F irst snap sh o t tak en 40 seconds into th e video recorded a t NASA LaRC. T he fram e as
cap tu red by th e cam era is on th e left and th e real-tim e R etinex processed fram e is on th e right.

F ig u r e 5.6: Second snapshot ta k en 6 m inutes and 28 seconds into th e video. Colors are nearly
com pletely indeterm inable an d o b jects are difficult to distinguish in th e unprocessed image. Colors
and objects are still clear in th e processed fram e.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

67

F ig u r e 5.7: T h ird snap sh o t tak en 14 m inutes 28 seconds into th e video. T h e only distinguishable
o bject in th e unprocessed fram e is th e tail-lights on th e vehicle. A lthough noisy, th e real-tim e
R etinex processed im age still clearly shows m ost of th e m ajo r objects in th e first sn apshot including
spheres, tre e lines, and parked vehicles.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

C hapter 6

E nhanced V ision System Case
Study

6.1

B a ck g ro u n d

The real-time Retinex can be used to enable a wide variety of applications. We have chosen
a NASA LaRC developed Enhanced Vision System (EVS) to dem onstrate the performance
of the real-time Retinex in an actual system. The EVS is a new aviation safety technology
th a t is used to provide enhanced images of the flight environment to assist pilots flying in
low visibility conditions such as rain, snow, fog, or haze [98]. During August and September
of 2005, the EVS, and many other new technologies, were dem onstrated during flight tests
on the NASA 757 as part of the Follow-On Radar. Enhanced and Synthetic Vision Systems
Integration Technology Evaluation (FORESITE) program.
The EVS contains a long-wave infrared (LWIR), a short-wave infrared (SWIR), and a
visible-band camera, all mounted in an enclosure th a t is flown beneath a NASA 757 aircraft.

08

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

F ig u r e 6.1: T h e EVS LW IR, SW IR , and visible-band cam eras m ounted to a baseplate, and th e
enclosure shell. In accu rate bore-sighting can cause im age reg istratio n problem s.

Figure G.l shows the cameras mounted to a baseplate and the enclosure shell. Figure 6.2
shows the enclosure installed on the aircraft. Figure 6.3 shows the aircraft during a runway
approach with the simulated shaded area depicting the field of view (FOV) of the cameras.
The LWIR. is a Lockheed Sanders LTC500 therm al imager and senses radiation in the 7.514 fi.m band. It can image background scenery, terrain features and obstacles at night and
in other low visibility conditions. The SWIR is a Merlin Near-Infrared (NIR.) camera that
senses in the 0.9-1.68 fj,m region and is optimal for detecting peak radiance from runway
and taxiway lights even in poor visibility conditions. The visible-band camera is a Bowtech
BP-L3C-II CCD th a t detects the 0.4-0.78 /im band and covers imaging runway markings,
skyline and city lights in good visibility conditions. A frame from each of the three video
streams generated by the cameras in clear weather conditions is shown in Figure 6.4.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

70

F ig u r e 6.2: EVS cam era enclosure m ounted forw ard-looking u n d ern ea th th e NASA 757.

6.2

Im a g e P r o c e ss in g F u n ctio n s

The image processing architecture for the EVS is outlined in the top of Figure 6.5 [26]. The
analog National Television System Committee (NTSC) RS-170 outputs of the SWIR and
LWIR cameras are routed from the EVS camera enclosure (mounted beneath the NASA
757) to the processing board through a video distribution box. The processing board is
situated in a pallet within the NASA 757 approximately 120 feet away from the EVS
camera enclosure. Similarly, the digital RS422 outputs of the cameras are transferred to
the processing board using optical fibers. We do not use these outputs, but for future
implementations they may have a better signal-to-noise ratio than the analog outputs.
The functions performed by the processing components are shown in the bottom of Fig
ure 6.5. The m ulti-spectral d ata streams from the EVS cameras must be resized, enhanced,
registered, and fused into a single image stream. The images are resized into dimensions th a t
are a power-of-two to fit the input requirement for the F F T (see Section 5.1.1). Methods

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

71

F ig u r e 6.3: T he EVS acquires d a ta during th e entire flight b u t take-oflF and landing phases are
critical. T h e sim ulated shaded area depicts th e field of view (FOV) of th e cam eras.

for resizing are discussed in Sections 6.3 and 6.4. Enhancement is performed to improve the
information content of the images particularly in poor visibility conditions. For enhance
ment we use the real-time Retinex. The Retinex provides an ideal solution for enhancing
EVS imagery because of its superb peformance in improving low-contrast, dimly-lit images.
Registration is used to remove field of view (FOV) and spatial resolution differences
between the cameras, and to correct bore-sighting inaccuracies [23]. Table 6.1 gives charac
teristics of the sensors th a t are relevant to registration. Registration is performed by first
manually selecting a set of control points based on corresponding features in a LWIR and
SWIR frame acquired at the same time. The control points are analyzed using multiple
linear regression to approxim ate the coefficients of an affine transform which is applied to
the LWIR image. The transformed image is then resampled using bilinear interpolation to
align the registered LWIR image d ata to the same grid as the reference SWIR image. The
same transform can then be used on all other LWIR frames since the optical param eters

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

L W IR

SW IR

VIS

F ig u r e 6.4: E xam ples of th e im agery generated by each cam era in good w eather conditions. T he
images from cam eras m ust be registered, enhanced, fused and displayed to th e pilot in real-tim e.

and the camera alignment are assumed to remain constant during flight. Appendix A gives
a more detailed discussion of the registration procedure.

Image Dimensions (pixels)
Optics FOV
Detector Readout Frame Rate

SWIR
320H x 240V
U ° H x 25° V
60 Hz

LWIR
320H x 240V
39° H x 29°V
60 Hz

CCD
542H x 497V
U ° H x 25°V
30 Hz (interlaced)

T a b le 6.1: Sensor Specifications

The two enhanced and registered video streams from the SWIR and LWIR cameras
are then fused into a composite video stream th a t contains more information than either
input spectral band. This also provides the additional benefit of producing a single output
to observe instead of multiple images from multiple video sources. The Retinex could be
used as a fusion engine for this application since the algorithm performs nearly symmetric
processing on m ulti-spectral data. Multiple camera inputs could be distributed onto these
multi-spectral processing chains and fused using the weighting and summation properties
of the Retinex [57]. However, for EVS processing the image streams are fused by effectively

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

73

SW IR
N TSC

Analog N T SC

SWIR

RS422
Lyte L in k

Video
Dist.
Box

DSP
Board

R S I 70

Enhance
SWIR

Resize

Fuse
Input Frame
from LWIR
Camera

Display

LW IR
N TSC

A nalog N T SC

EVS Cameras

Input Frame
from SW IR '
Camera

L yteL ink

Resize

Enhance
LWIR

O utput
Frame

Register
LWIR to
SWIR

F ig u r e 6.5: Im age processing arc h itec tu re and functions of th e EVS. Analog N TSC cam era o u tp u ts
are curren tly processed. T h e SW IR d a ta is used as th e baseline for reg istratio n since it lias th e
sm allest field of view.

performing a weighted sum of the two processed outputs since a different Retinex is applied
to each channel.

Pixels are summed on an inter-frame basis.

O ther m ethods such as

interleaving frames or fields causes sever flicker. The fused d ata stream is output as a
standard composite NTSC signal into a display.

6 .3

A d d itio n a l R e q u ir e m e n ts

Several other EVS param eters complete our baseline requirements and constraints for real
time Retinex processing. First, our initial performance goal is to achieve a display rate of 15
fps, instead of the de facto standard of 30 fps for real-time video. We can use this reduced
rate because our final processed output will be sent to a pilot’s display and several human
factor studies have shown th at an update rate of 15 fps is more than sufficient to avoid flicker,

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

74

accurately portray motion [100], and not cause pilot induced oscillations (PIOs) [34, 40] of
the aircraft. A PIO can occur when a pilot views and reacts to an instrum ent or display
th a t is updated too slowly (< 12 fps) [2, 44].
Second, the cockpit displays are low-resolution (320 x 240). This significantly reduces
the amount of image d ata th a t must be processed. F itting the closest power-of-two input
requirement for the F F T to this frame size dictates th a t we process a 256 x 256 portion
of each frame. Only 20 percent of the horizontal component of the image is lost and the
vertical component is zero-padded to fill 256 pixels.
Third, only the SWIR and LWIR cameras are targeted for processing by the current
EVS sponsors. The visible band camera is only used to provide context. Use of the visible
band d ata in conjunction with the infrared cameras to improve the information provided
to a pilot is an open research topic. For now, processing only the monochrome SWIR and
LWIR cameras reduces the number of bands th at have to be processed from 5 (1 each for
the LWIR and SWIR, and 3 for the visible-band camera) to 2. Only processing the SWIR
and LWIR cameras also enables the use of the SSMR version of the Retinex since it provides
good enhancement of single-band infrared imagery with the additional benefit of minimizing
com putational requirements.
Several environmental param eters are defined for the EVS. The space allocated is ap
proximately 17 wide by 8 inches deep by 3 inches high. This is enough space to hold a
standard PCI board, thus allowing a board-level (vs. chip-level) solution, but eliminates
multiple board or cluster solutions. The operational tem perature range falls within the
standard commercial tem perature range of 0 to 70 degrees C. The maximum power allo
cated for image processing is approximately 5 w atts with a standard input voltage of 5

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

75

F ig u r e 6.6: DM642 EVM , signal sp litte r boards, and
power supply in flight box.

F ig u r e 6.7: Flight box in flight pallet
on NASA 757.

volts and current limited to 1 amp. If other input voltages are required D C /D C converters
can be used within the space allocated. Waivers for additional power can also be requested
since the NASA 757 has many power resources. However, general aviation aircraft have
significantly fewer resources and it is beneficial to limit our resource allocation for potential
use in these environments also.
Each EVM discussed in Section 4.1 easily fits within the physical constraints of the
EVS, however only the DM642 EVM has two video inputs to accept the two infrared
camera outputs. The DM642 EVM was flight hardened1 and the board was encased in a
rack-mountable box with interfaces and switches extended to the front and rear panels. A
power supply and signal break-out cards are also enclosed in the box. Figure 6.6 shows the
DM642 EVM and other devices in the flight box and Figure 6.7 shows the box in the flight
pallet on the NASA 757.
A new m ethod to update param eters was developed for the DM642 EVM because a host
PC with a JTAG em ulator was not available for continuous use during flight test to perform
RTDX based updates. Instead of using the JTAG port, our new method uses the Ethernet
1F lig h t-h a rd e n in g m ean s th a t c o m p o n en ts a re secu red to p re v e n t b eing sh ak en loose d u rin g flight.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

76

port on the DM642 for communication with an external PC thus eliminating the need for
a JTAG emulator. A new task was w ritten to process messages received via Ethernet using
the mailbox module in D SP/BIOS. The mailbox module provides a set of functions that
are used to pass synchronized messages from one task to another on the same processor. In
this case, the param eter update messages are passed from the Ethernet task to the main
frame processing task discussed in Section 4.6.

6 .4

R e s u lts

The EVS was tested during FOR ESITE flight dem onstrations in August and September of
2005. All flights were performed in good weather and although this was not ideal for testing
the performance of the EVS, this still enabled a thorough evaluation of the functionality
of the EVS components including the real-time Retinex. As mentioned in Section 6.2 we
have to individually resize and enhance the monochrome output images of the SWIR and
LWIR cameras, register the LWIR to the SWIR, and then fuse the two channels together.
Since both cameras are flown upside-down underneath the NASA 757, the images must
be rotated 180° for normal viewing. This is usually performed using embedded routines
in the cameras but unfortunately, the cam era integrators were unable to rotate and place
the corresponding gamma look-up tables in ROM for the LWIR camera. We decided to
perform the rotation of the LWIR image within our image processing routines on the DSP.
We modified our Retinex routine to read in the LWIR image d ata starting at the end of
the image d ata and proceeding to the first, pixel. This causes a 180° rotation of the image.
Our sequence of tasks is as follows:

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

77

• resize the LWIR input image to 256 x 256 pixels,

• R otate and Retinex the LWIR image,

• resize the SWIR input image to 256 x 256 pixels,

• Retinex the SWIR image,

• register the enhanced LWIR image to the enhanced SWIR image,

• interpolate the LWIR image to the SWIR grid,

• fuse and output the final processed image.

Higher quality imagery is achieved by enhancing the LWIR image before performing reg
istration, instead of registering first, since registration may eliminate part of the original
image when it is transformed.
Our algorithm performed the above sequence of tasks on the DM642 at 33.89 fps. Sample
input frames2 from the SWIR and LWIR cameras are shown in Figure 6.8 and Figure 6.9,
respectively. The LWIR input is actually received from the LWIR cam era rotated (upside
down) 180°, but is shown right-side up for viewing purposes. The same SWIR frame after
SSMR enhancement is shown in Figure 6.10. It is easy to see the improved contrast and
brightness in the image. Similarly, a frame of the enhanced and registered LWIR channel is
shown in Figure 6.11. Registration can be seen by noting the large vortical shift downward
at the top of the image. Both of the SWIR and LWIR enhanced frames shown are captured
as interm ediate results for dem onstration purposes and not the final output product of our
2W e w ro te a sm all u tility to sen d im age d a ta from th e D S P to th e h o st to c a p tu re fram es a t various
stag es o f processing.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

F ig u r e 6.8: A fram e from th e EVS SW IR cam era before processing. T he faint vertical lines were
p a rt of th e in p u t im age and p robably caused by subsam pling in th e video d istrib u tio n system .

processing. The filial fused output is shown in Figure 6.12. This image has significantly
better contrast, brightness, and sharpness than any of the original inputs, and provides a
single enhanced output for the pilot to view. Enhancement and registration param eters
were determined empirically.
Our fused output image is actually a 512 x 512 image, but we are only processing 256
x 256 pixels per image. The CCD arrays for both the imagers are approximately 320 x 240
pixels, but the NTSC composite inputs received are upsampled to 640 x 480 through pixel
replication (horizontally) and line duplication (vertically). We used this information and
modified our core Retinex routine to generate a 512 x 512 image by 2:1 subsampling the
horizontal and vertical components of our input images. This process retains the majority
of the original resolution of the cameras.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

79

Figure 6.9: A frame from the EVS LWIR camera before processing. The LWIR camera output is
actually rotated 180° from what is shown.

An additional interesting addendum to this process was the requirement to store the
algorithm in non-volatile flash memory so th a t the algorithm would autom atically execute
at system power-up. As discussed earlier, an Ethernet client was added to the code to
facilitate communication with a host to update Retinex parameters. This expanded the
size of the executable beyond the flash page boundary so we developed a new multi-page
bootloader algorithm to implement this feature. Development of this algorithm is discussed
in Appendix C. This information will be used in a new TI application report on bootloaders
for their C6X processors.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

80

1

F ig u r e 6.10: S W IR fram e after enhancem ent.

F ig u r e 6.11: LW IR fram e after enhancem ent and registration to th e SW IR image.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

F ig u r e 6.12: E nhanced, registered and fused o u tp u t image.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

C hapter 7

Future R esearch
As with most research topics, there is always the question of “how can we make it b etter?” .
For our real-time Retinex, improvements can basically be categorized as (1) increasing the
performance of the algorithm on DSPs to provide 30 fps MSR performance, (2) processing
larger format images, and (3) migrating to a multi-processor environment. It would also
be beneficial to integrate additions th a t augment the Retinex, such as color restoration or
white balance techniques, into our real-time version of the algorithm, but the three prim ary
areas listed above should be solved first.

7.1

L u m a -o n ly R e tin e x

Before addressing these issues we briefly digress to discuss a m ethod that can immediately
provide a near Retinex quality enhancement at full 30 fps performance for certain appli
cations. This alternative version of the Retinex is called the luma-only Retinex (LOR). In
the LOR algorithm, only the luma, Y' , component of an image is processed. The chroma
components are left unchanged and passed directly from input to output. The enhancement
82

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

83

quality of this algorithm is very good because the m ajority of spatial detail is contained in
the luma component of an image.
Processing only the luma component eliminates all of the Y ' C b C r to RGB input and
output conversions. As discussed in Section 4, the DM642 EVM stores image d ata in planar
form and the C6416 EVM stores image interleaved. Thus for the DM642 only a pointer
to the Y ' component is required to access the input and to generate the output. The Y '
must still be extracted from the input image d ata and embedded in the output d ata on
the C6416, but this is performed very efficiently. Since only the Y ' component is processed
using three scales, the performance is analogous to th a t shown for the DM642 and C6416
for 1 band and 3 scales — 38 fps for the DM642 at 200 MHz and 58 fps on the C6416.

7.2

Im p r o v in g C u rren t P er fo r m a n ce

To improve our full, real-time Retinex to meet 30 fps performance on DSPs would require
moderate speed-ups in processor performance (by ~ 33 percent) and either a similar speed
up in EMIF bus rates and external memory access times or a L2 memory large enough to
remove at least some of the DMA requirements. Our final algorithm execution tim e is driven
by processor cycles, not I/O bandwidth. However, as we showed in migrating our code
from the C6711 to the C6713, when processor clock speed is increased, the I/O bandw idth
needs to improve1 also or it will become the bottleneck on performance. Having a larger
L2 memory and placing larger segments of d ata there implicitly improves I/O bandwidth
because it removes the requirement to transfer th a t particular data. We dem onstrated this
by keeping the logarithm of the input image in the L2 memory on the C6416.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

84

Extending this concept, to store the F F T of a color input image requires 1.5-MBytes
of memory with 512-KBytes per band.

Retaining our current technique of storing the

image logarithm d ata locally implies th a t we would need a L2 memory size of between
2 and 3-MBytes, well within the reach of next generation DSPs. Storing the F F T in L2
memory would eliminate the transfer of the F F T of each row to external memory, and the
2-dimension read of external memory to form the column data. Just looking at the rows
and ignoring function call overhead, theoretically to DMA a row requires 1.92 microseconds
through a 64-bit wide EMIF bus clocked at 133 MHz. To DMA 256 rows requires 4.93
ms. Measured 2-D transfer times were on the order of twice the row transfer time or ~ 10
ms. Placing the F F T d ata in L2 memory would not directly achieve a 15 ms increase in
performance because these transfer are currently performed in the background. However it
does mean th a t any additional processor improvements wonld then be immediately effective,
thus with a commensurate increase in processor performance, 30 fps MSR would be easily
achievable. An alternative idea is to attem p t to store all of the convolved data, but this
would require 512-KBytes per scale per band equating to 4.5-MBytes for the MSR. Local
DSP memories on this order are probably years away.

7.3

P r o c e ss in g L arger F o rm a t Im a g es

Processing larger format images exacerbates the issues address above. First, significantly
more processing cycles are required.

Using the calculation of the F F T as an example,

Table 7.1 shows the number of cycles and the associated processing time for different F F T
sizes executing on the C6711, DM642, and the C6416 processors. The benchmark equation

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

85

for the out-of-place cache-optimized mixed radix F F T executing on the C6711 originally
given in Section 5.1.9 is repeated here as

C = (3[log4(n - 1)"|n) + (21|"log4(n - 1)] + (2n) + 44.

(7.1)

and the benchmark equation for an extended-precision, mixed radix 16 x 32 F F T with
rounding, and digit reversal executing on the DM642 and C6416 is repeated here as

C = (13n/8 + 24)( [log4(n) - 1]) + (n + 8)1.5 + 27.

The processing time for FF T s ranging in size from 256 to 2048 are shown

(7.2)

in Table 7.1.

Referencing the information in Table 7.1, Table 7.2 gives benchm ark F F T performance
FF T Benchmarks
FF T Size
256
512
1024
2048

C6711 @ 150 MHz
cycles
/iS
2923
7296
14464
34965

19.49
48.64
96.43
233.10

DM642 @ 720 MHz
cycles
flS
2.42
1743
5.88
4231
8327
11.56
19871
27.60

C6416 @ 1 GHz
cycles
//s
1743
4231
8327
19871

1.74
4.23
8.33
19.88

T a b le 7.1: F F T B enchm arks for CG711, DM642 and C6416.

values for various input image sizes. As can be seen from this data, to perform a 512 x
256 F F T on the C6711 takes nearly the full 33.33 ms time alloted to process a frame at 30
fps. Initially looking at the d ata the DM642 and C6416 perform significantly better and
seem to be potential solutions for 512 x 512 sized images. However, this is only the forward
F F T of the input image. Subsequently three inverse FFTs (one for each scale) must be
performed for each band, each taking the same time as the forward FFT . This drives the
F F T processing time for a 512 x 512 image to 60.02 for the DM642 and 43.33 ms for the

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

8G

F FT linage Benchmarks
x-dim
y-dim
256
512
512
1024
1024
2048
2048

C6711 @ 150 MHz
ms

DM642 @ 720 MHz
ms

C6416 @ 1 GHz
ms

9.98
29.89
49.81
123.64
197.48
576.13
954.78

1.23
3.63
6.02
14.85
23.67
68.36
113.04

0.89
2.61
4.33
10.7
17.06
49.24
81.43

256
256
512
512
1024
1024
2048

T a b le 7.2: F F T Processing T im e B enchm arks using C6711 and DM642 for various sized images.

C6416 exceeding the 33.33 ms boundary. Again, this is just the time to process FFTs, other
com putations must be included to perform Retinex enhancement.
The second issue th a t the size of the images are considerably larger, thus requiring more
memory for storage and more bandw idth for transfers. Table 7.3 shows typical F F T storage
requirements and 64-bit, 133 MHz EMIF transfer times for various sized images. As shown,
the F F T of a 512 x 512 image requires 2-MBytes for storage eliminating any possibility of
keeping this d ata in current DSP L2 memory.
F F T Image Size
x-dim
y-dim
256
512
512
1024
1024
2048
2048

256
256
512
512
1024
1024
2048

Memory Requirement
MBytes

EMIF Transfer Time
ms

0.5
1
2
4
8
16
32

4.93
9.86
19.72
39.44
78.88
157.76
315.52

T a b le 7.3: F F T storage requirem ents and transfer tim es (based on row oriented d ata ) for various
sized images. S torage is based on com plex im age d a ta stored as integers. Transfer tim es are based
on a 64-bit E M IF bus clocked a t 133 MHz.

Incremental increases in performance could also be achieved by modifying the FFT. Since

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

87

our image signal is real-valued we could use the imaginary part of the F F T input and exploit
the symmetry of the frequency spectrum to compute either a 2N-point, sequence using an
N-point F F T or compute two, N-point FFTs simultaneously [7, 75, 63], This technique can
perform the FF T s « 30—40% faster [63] than the conventional method, but the overhead
associated with interlacing the input and unscrambling the output reduces the effectiveness
of this method. The F F T routine currently used could be rew ritten to take advantage of
alternative fast bit-reversal techniques such as those introduced by Zhang [104], P itas and
Strintzis [52] discuss an interesting m ethod to build up the column transform in steps while
selectively processing rows to reduce the I/O operations between hard disk and internal
memory. Although hard disk access is several orders of magnitude slower than external-tointernal memory transfers, an adaptation of this m ethod could be used for external-t,o-L2
memory transfers.
There is no fundamental reason why we have to use the row-column m ethod to decom
pose the 2-D D FT. We could possibly reduce the number of arithm etic operations performed
by using other algorithms such as a vector-radix Fast Fourier algorithm [22], a polynomial
transform F F T [48], or a fast 2-D Hartley transform [5]. O ther techniques, such as using fast
m atrix transposition m ethods to reduce the number of I/O operations [16, 15], could also be
explored. While all of these methods are worthwhile, revolutionary increases in performance
will probably only be addressed through using alternative processing platforms.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

88

7.4

M ig r a tin g to a M u ltip r o c e sso r E n v ir o n m en t

Another strategy is to map the MSR algorithm into a multiprocessor system [35, 6] and
take advantage of the parallelism of the algorithm. Most multiprocessor systems in general,
exceed our initial constraint of performing the MSR on a small, embeddable, low-power
system. However as newer technologies emerge this may become a viable alternative. Even
today, several relatively small multiprocessor boards are available from vendors such as
MangoDSP, Sundance or Vitecmm.
A system th a t completely distributes the prim ary tasks of the MSR could resemble a
design similar to th a t in Figure 7.1. The first level task splits the input image into its RGB
spectral components. The next two levels perform forward row and column transforms,
respectively. The output of this level is fed into three other tasks, each performing convolu
tion of the now spatial frequency domain image d ata with the associated kernel. The next
two levels perform inverse FF T s of the columns and rows respectively, for each convolved
output. The next level combines the d ata for each scale, computes the log and subtracts
this from the log of the original image. The final task combines the processed d ata from
each band. Each task could be m apped to an individual processor or assigned to a pool
of processors. Similar to our EMIF bus bandwidth issues, iuterprocessor communication
and d ata sharing will need to be carefully balanced. The processors used to perform these
tasks could be DSPs, FPGAs (see Appendix B), or a mixture of both. In a heterogeneous
system, FPGA s could perform pre- and post-processing tasks, while DSPs perform the core
FFTs and convolutions. In this dissertation we have established a core set of techniques
th a t could easily be used to implement the Retinex in this multiprocessing environment.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

89

Decompose
Input to RGB

Log
rows

Log
rows

FFT
rows
1
FFT
cols

Conv Conv Conv
S2
SI
S3
1
1
1
IF F T IF F T IF F T
cols cols cols

tzzj

FF T
rows
1
FFT
cols
^ 1 '" Conv Conv Conv
SI
S2
S3
1
1
1
IF F T IF F T IF F T
cols cols cols

Log
rows

FF T
rows
1
FFT
cols

Conv Conv Conv
SI
S2
S3
1
1
IFFT IFFT IF F T
cols cols cols

IF F T IF F T IF F T
rows rows rows

IF F T IF F T IF F T
rows rows rows

IFFT IF F T IF F T
rows rows rows

Combine
Red-Band

Com bine
Blu-1 land

Combine
Grn-Band

Combine
Bands
F ig u r e 7.1: D a ta flow diagram of M SR task s

Ultimately, it would be beneficial to develop an embeddable single chip-level imple
mentation of the processing components of the algorithm. We would start by using the
techniques we developed to place the MSR tasks described above into one or more FPGAs.
Commercial tools are available from companies such as Celoxica, Accelchip, and Catalytic,
th a t autom atically convert C code developed for DSPs into VHDL, the current language of
choice for FPGAs, and multi-FPGA boards are available from companies such as Sundance
and Nallatech. Implementation in an FPG A would enable the full customization of our
design and a direct migration path to an ASIC.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

C hapter 8

C onclusions
In the last few years the multi-spectral, multi-scale Retinex has provided outstanding image
enhancement of still imagery for numerous users. Literally thousands of images have been
processed. The first, versions of the multi-scale, color Retinex, coded on a Windows NT
200 MHz Pentium Pro PC and processing a 512 x 512 image, executed in ~ 45 seconds
— more than three orders of m agnitude slower than required for real-time performance.
Current PC implementations of the Retinex for a 512 x 512 image execute in ~ 3 seconds,
still two orders of magnitude too slow to be considered for real-time applications. It was
my thesis th a t a real-time, 15 fps multi-scale, m ulti-spectral Retinex could be achieved on a
single-processor embedded system through proper algorithm and architecture optimization.
The summation of this dissertation is th a t we have successfully achieved this goal.
Throughout this research a series of optimizations were developed, investigated, and
implemented on progressively faster DSPs, each with more capability. These techniques
were discussed in C hapter Five. We began by focusing on the single-scale monochromatic:
Retinex targeting the floating-point, C6711 and CG713 DSPs and achieved 20.7 fps and 28

90

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

91

fps performance, respectively. Although the C671Xs platforms did not allow us to obtain
real-time MSR performance, the core algorithm structures and techniques developed for
them, such as merging algorithm components to reduce I/O and performing effective DMA
routines, were used repeatedly in future implementations. We then changed our hardware
target to the fixed-point DM642 DSP. After modifying our single scale Retinex design into
a fixed-point implementation and adding addition optimization techniques, we obtained 69
fps performance on this platform.
Using the knowledge gained from our previous experiences, we focused our research
on the more com putationally intensive multi-scale Retinex, while continuing to target the
DM642 DSP and adding the more powerful C6416. We again developed and implemented
additional optimizations into our core algorithm focusing on constructs specific to multi
scale, multi-band processing and taking advantage of the additional resources within the
processors. This includes restructuring the m athem atics of the algorithm to enable exploit
ing the pre-com putation of additional param eters and modifying our buffering scheme to
keep DMA processes from driving the algorithm computation time. Our best performance
on the most com putational intensive version of the Retinex (the MSR) was 20.25 fps using
the C6416 platform. This exceeded our baseline target of 15 fps but still requires further
exploration to meet, 30 fps.
We applied our real-time algorithm in actual flight hardware during dem onstrations at
NASA LaRC enhancing, registering, and fusing two infrared video camera outputs. This
was a significant achievement however, the accomplishment of this research extends beyond
this one application. It provides a new tool for image enhancement to a broad range of
users and will provide the basis for further academic research. Future implementations can

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

92

use the core techniques we have developed and dem onstrated and will hopefully achieve
even better performance through the use of multi-processor systems, FPGAs, or ASICs.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

A ppendix A

M ulti-Im age R egistration
Coupling infrared sensors with visible band sensors — for frame of reference or for additional
spectral information — and properly processing the multiple information stream s has the
potential to provide valuable information in night an d /o r poor visibility conditions. In
Chapter 6, we discussed an EVS th a t is being developed to test this concept. A set of
images consisting of an image from each of the cameras of the EVS taken during one timealigned frame is fused into a single image th at contains more information than any individual
spectral band. This process is then repeated for all the image frames making up a video
sequence. To properly perform fusion it is critical to ensure th a t the information from each
sensor refers to the same features in the environment [8, 43]. The different sensors of the
EVS have different acquisition lattices and optics, therefore they capture information in
d ata structures th a t are substantially different from each other. Thus, the images must
first be registered before any fusion is performed. Several authors have addressed image
registration problems with innovative, but often complex, general solutions [42, 60, 41]. In
this appendix, we describe two straightforward solutions for registering EVS images.

93

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

A .l

B a ck g ro u n d

Image registration is the task of aligning images taken at different times, from different
sensors, or from different viewpoints so th a t all corresponding points in the images match. A
transform must be defined th a t relates the points in one image to their corresponding points
in another. This transform depends upon the characteristics of the differences between the
images being registered, and is computed with respect to a reference or baseline image. The
images th a t are to be m atched to the reference are called the sensed, or, distorted image.
More particularly, image registration is defined as a mapping between two or more im
ages both spatially (geometrically) and with respect to intensity. Expressed mathem atically
we have:

h = g(h

x 2))),

where I\ and I 2 are two-dimensional images (indexed by xi,x, 2 ), / : (^ 1 , ^ 2 ) —*

(A .l)

(#11

#2 )

maps the indices of the distorted frame to match those of the reference frame, and g is a
one-dimensional intensity or radiometric transform [9]. We assume th a t we do not need
to make any radiom etric adjustments, so g — I , the identity transform. Hence we are
concerned only with the spatial transformation, / . In generating a spatial transform for the
EVS, our prim ary difficulty is the lack of fiducial markers within the images generated by the
EVS sensors. The cameras are, however, assumed to be bore-sighted so they are expected
to have a common center of alignment. The spatial transform should, then, properly align
the images, but should not affect any characteristic differences th at should be exposed by
registration.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Spatial transforms may take on different forms depending upon the application. Simple,
common transforms specified by analytic expressions include rigid-body, affine, projective
or perspective, and polynomial [59, 47]. The distortions between the images of the EVS
in general seem constrained to those correctable by affine transforms. They also appear
to be characterizable by a global (versus local [21]) transform where a single transform
correctly maps all the points on the distorted image to match the corresponding points on
the reference image. An affine transform fulfills the requirements for the needed transform.
An affine transform can perform rotation, translation, scaling and shearing operations.
It offers six degrees of freedom when selecting six unknown coefficients and solving a system
of six linear equations. In general, it can perform triangle-to-triangle mappings. A general
representation of an affine transform is [3/ 1 , 2/2 1 1] = [^ii^2i 1]T where

T =

an

a\2

0

021

«22

0

«31

a 32

1

(A.2)

:r,\ and x,2 reference the input coordinate system, y\ and

2/2

reference the output coordinate

system, and a\j are transform coefficients [102],
The forward mapping functions are

(I2 1 X 2 +

2/1

=

c q i - 'E i +

«3i

2/2

=

a i 2 X j + (l22'X2 + ° 3 2 -

and

(A.3)
(A-4)

Geometric, image-to-image registration can be summarized in three general steps:
1. Feature identification and matching is performed to establish a correspondence be
tween features in the distorted image to those in the reference image;

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

96

2. A spatial transform ation is selected and the transform ation coefficients are computed
based upon the feat tire matching criteria;

3. The distorted image is inverse-mapped using the computed transform ation and re
sampled to register it with the reference image.

Feature identification and matching are often performed by selecting pixel locations called
control points.

Identification of control points can be accomplished in several different

ways [18, 21]. Manual identification of control points is commonly performed. The images
are displayed, normally side-by-side, and corresponding points usually based on features
such as lines, edges, or contours are selected from both images.
The spatial transform coefficients th a t represent the unknown image distortions are de
termined from the control points. A minimum of three non-collinear control points are
required to determine the six unknown coefficients of an affine transformation. Wolberg
and Jensen [102, 31] describe several techniques to solve for unknown coefficients includ
ing pseudo-inverse solutions, least squares with ordinary and orthogonal polynomials, and
weighted least squares with orthogonal polynomials.
Image resampling is the process of transforming a sampled image from one (input pixel
grid) coordinate system to another (output pixel grid), where a sampled image is the dig
itization of the spatial coordinates of an image function / ( j / i , jr/a) — a two-dimensional
intensity function [102, 13, 20]. The two coordinate systems are related to each other by
the mapping function of a spatial transformation.
To perform image resampling, initially, the output pixels are inverse mapped using the
transform ation function to a new grid which (usually) doesn’t correspond to the input

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

97

grid. Thus an interpolation (image reconstruction) procedure is used to generate a con
tinuous surface through the samples of the new grid. Then the input image is sampled
(digitized) at these points to provide the discrete output pixel values of the process. Three
common methods of interpolation are nearest neighbor, bilinear, and param etric cubic con
volution [102, 50].
Table A .l shows the relevant m anufacturer characteristics of the sensors [98].

The

images from the three sensors obviously need to be be registered because of the differences
in these characteristics. The solutions developed to resolve these differences are discussed
in Section A.2.

Image Dimensions (pixels)
Optics FOV
Detector Readout Frame Rate

SWIR
320H x 240V
34°H x 25°V
60Hz (typical)

LWIR
320H x 240V
39°H x 29°V
60Hz

CCD
542H x 497V
34°H x 25°V
30Hz (interlaced)

T a b le A .l : Sensor Specifications

The characteristics of the actual images obtained for registration differ from the initial
manufacturer specifications because of d ata acquisition and storage to tape. First, all images
have a nominal image size of 640 x 480 pixels corresponding to the NTSC format of the
recorded images. However, the actual size of the images is quite different after the images
are cropped so th a t the FOVs m atch the “visible1' part of the images (see Section A.2).
Second, ground test measurements of the cam eras1 FOVs differed from the manufacturer
provided values. These updated characteristics are shown in Table A.2, and need to be
included in the com putations for proper registration of the d ata streams.
The algorithms operate on a set of three, time-aligned images where each image is
acquired by an individual camera of the EVS. Each of the video stream s is recorded, or

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

98

Image Dimensions (pixels)
Optics FOV
Detector Readout Frame Rate

SWIR
6 4 0 # x 480V
3 1 .5 °# x 23.5°V
60Hz (typical)

LWIR
6 4 0 # x 480V
4 1 ° # x 30.75°V
60Hz

CCD
6 4 0 # x 480V
3 3 .5 °# x 25°V
30Hz (interlaced)

T a b le A .2: U p d a ted Sensor Specifications

post-processed, with video tirnecode information in each frame. The frames are time-aligned
simply by finding the frames with matching time codes. This set of time-aligned frames
is then used to obtain the registration param eters with respect to the baseline frame. All
other frames of the video sequence can be processed with the same param eters. Each frame,
including the ones from the color CCD sensor, is converted to grayscale before registration
and further processing.

A .2

R e g is tr a tio n a lg o r ith m s

Our first solution for image registration is based solely on camera sensor specifications. The
cameras were assumed to be properly bore-sighted at installation thus the only distortion
param eters to account for in registration are the differences in FOVs and resolutions. This
algorithm, called the SS (sensor specifications) algorithm, performs registration by first
equalizing the FOVs and then resampling the distorted image to match reference resolu
tions. Based upon the lessons learned from the SS algorithm, a geometric image-to-image
registration algorithm was implemented. Both of these algorithms are discussed below. For
each of the algorithms, we use the SWIR image as the baseline since it has the “worst”
image param eters (the smallest FOV and poorest spatial resolution). The size of an image
can be modified through interpolation but we cannot increase the FOV.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

99

A . 2.1

SS a lg o r ith m

The first step of the SS algorithm is to equalize the instantaneous FOVs (IFOV)s of the
sensors. The FOV is the angular extent of the full image on the sensor and the IFOV is
the angular extent on an individual detector element, i.e., the solid angle through which a
detector element is sensitive to radiation.
From Figures A .l, A.2, and

A.3, we observe th a t the visible portion of the images

is actually smaller than the full image capture window. The FOVs listed in Table A.2
are assumed to correspond to the visible portion and not the capture window. Thus, the
first stage of processing is to crop the images to the visible portions. The second stage of
processing is to ensure th a t the two images are representing the same portion of the scene.
Since the FOVs of the SWIR and the LWIR sensors differ — LWIR has the greater FOV and
hence captures a wider swath of the scene — the LWIR image needs to be cropped so that
it encompasses the same FOV as th a t encompassed in the SWIR image. The dimensions
of the cropped LWIR images — the number of columns and rows — are determined by
a simple scaling operation.

The horizontal and vertical IFOVs of the LWIR image are

obtained using

FOV-LWIR-HORIZONTAL
IFOV-LWIR-HORIZONTAL = ------------------------------LWIR-COLS

(A.5)
v '

FOV-LWIR-VERTICAL
IFOV-LWIR-VERTICAL = ---------------------------- ,
LWIR-ROWS

,,
(A.6)

and

respectively. The number of cropped columns and rows for the LWIR image is then deter-

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

100

mined by
colu m n s =
row s =

F0V-SWIR-H0RIZ0NTAL
IF0V-LWIR-H0RIZ0NTAL
FOV-SWIR-VERTICAL
IFOV-LWIR-VERTICAL

(A.7)
(A.8)

After cropping, the SWIR and LWIR FOVs are equal, but since the dimensions of the
cropped LWIR are different from the dimensions of the SWIR, the IFOVs of the LWIR
and the SWIR images are still different. To make the IFOVs the same we must resample
the cropped LWIR image so th a t it is the same size as the SWIR image. This entails: (1)
computing an expansion factor th a t will make the dimensions of the cropped LWIR image
greater than the dimensions of the SWIR (2) pixel replicating the cropped LWIR based
on the expansion factor and (3) downsampling the expanded LWIR image to the SWIR
dimensions. We use the bi-linear interpolation m ethod [10]. Nearest neighbor interpolation
can also be selected if desired but bilinear interpolation is more spatially accurate and
results in images th a t are slightly smoother.
A similar sequence of operations is performed between the SWIR image and the visible
image. If the FOVs are the same, as in Table A .l then the visible image is simply downsampled to m atch the SWIR resolution. The initial results from the SS algorithm clearly
indicated th at the distortions present in the images were not excessive, but they also were
not limited to FOV and resolution differences.

A .2 .2

M L R a lg o r ith m

Based on the results obtained from the SS algorithm a more general, geometric image to
image registration algorithm is implemented. The distortions between the images seem to
be due to sensor translation, (slight) rotation, scale change, and, possibly, shear. An affine

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

101

transform is, thus, used to model the spatial transformation. Control points are manually
selected for identifying and matching corresponding features between the reference and
distorted images. Since we assume th a t the sensors do not change alignment over time, we
only need to register one baseline set of images th a t can subsequently be used for the rest
of the image frames.
We use point mapping w ithout feedback [9] to approximate the global affine transfor
mation. The first stage of the MLR algorithm is to select a minimum of three non-collinear
control points from two input images.

More points can be chosen to make the coeffi

cients more representative of the distortions throughout the overall image if the points are
well distributed. Global distortion representation is also improved by choosing pixels on
the perim eter if possible. The control points are then analyzed using multiple linear regres
sion [65, 101] to approxim ate the coefficients of the affine transform. Residuals to determine
the accuracy of the regression model obtained are calculated. The defined affine transform
provides a mapping between the baseline and distorted images. The distorted image is
then resampled using the transform param eters to create the registered image. Bilinear
interpolation is used for resampling.

A .3

R e s u lts

To dem onstrate the performance of the algorithms we processed a set of videos taken by
the EVS cameras during a flight test at Patrick Henry airport in Newport News, Virginia.
The video sequence was taken as the NASA 757 aircraft approached a runway, and was
digitized using a Canopus Video Board. Three images (one from each camera) time-aligned

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

102

at 00:26:14:18 were used for registration. As stated earlier, the SWIR image is used as the
baseline for registration since it has the poorest spatial resolution and FOV. The SWIR,
LWIR and visible images are shown in Figures A .l, A.2 and A.3 respectively. To provide
a similarity metric to validate the performance of the registration algorithms we display the
absolute difference of the reference and corrected images. This provides a visible validation
of the registration process since features such as runway edges should align if registration
is performed correctly.
vTCR 0 0 :2 6 . 14:18

VTCR 0 0 :2 5 .1 4 :1 8

I
H

I

1

—

F ig u r e A .l : O riginal SW IR

A .3.1

F ig u r e A . 2: O riginal LW IR

F ig u r e A .3: O riginal Visible

S S a lg o r ith m

Applying the SS algorithm with the SWIR image as the baseline, and the LWIR and visible
images as distorted images yields the “registered” SWIR, LWIR and visible images shown
in Figures A.4, A.5 and A.6 respectively. The FOV of the LWIR image has been made
smaller to m atch the FOV of the SWIR image. This change in FOV can clearly be seen in
the horizontal direction of Figure A.5, by observing th a t the blurred artifact (which is an
antenna in the FOV of the camera) in the upper left corner of the original LWIR is now
almost completely removed in the registered image. In the vertical direction, the decrease
in FOV is noted by the missing timecode at the top and the missing ground features at the
bottom of Figure A.5 th at are in the original imago. The IFOVs have also been matched

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

103

though resampling. The general effects of resampling can be seen by observing the expansion
of image features from Figure A.2 to Figure A.5. The FOV of the original visible image
in Figure A.3 has been made slightly smaller, again to match the FOV of the SWIR, in
Figure A.6. Since the FOVs nearly m atch and the image dimensions are the same, there
is only a small expansion to match IFOVs, hence the registered image features are only
slightly increased from the original.
Figure A.7 is the differenced SWIR and SS registered LWIR, and Figure A.8 is the
differenced SS registered LWIR and SS registered visible image. The misalignment between
the images after registration can clearly be seen in Figure A.7 by observing the difference
in the outline of the runway from the LWIR component of the image, and the runway lights
from the SWIR image. There is at least a large translation and a small rotation difference
between the SWIR and registered LWIR. Similarly, the misalignment between the registered
LWIR and visible images differenced in Figure A.8 can also be seen by noting the difference
in the outline of the runway from the LWIR image, and the runway lights from the visible
image. Again, there is an obvious translation between the images. Figures A.7 and A.8
clearly display the misalignment between the images thus indicating th a t differences in
sensor design characteristics are not the only cause of distortion between the images.

F ig u r e A .4: C ropped SW IR

F ig u r e A .5: SS Reg. LW IR

F ig u r e A .6: SS Reg. visible

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

104

F ig u r e A .7: SW IR an d SS R egistered LW IR

A .3 .2

F ig u r e A .8: SS R egistered LW IR and visible

M L R a lg o r ith m

First we applied the MLR algorithm to the original (uncropped) SWIR and visible images,
again using the SWIR as the baseline. Due to the lack of features around the perim eter of
the SWIR image we used the runway lights as control points. Note th a t we are only using
three control points for dem onstration purposes. Figure A.9 repeats the original SWIR
image for reference. Figure A.10 shows the registered visible image and Figure A .11 is
the differenced SWIR and registered visible image. The coefficients obtained are given in
Table A.3.

x'
v'

bo
-0.546156
-20.440557

hi
1.021212
-0.007477

b‘2
-0.004578
0.972837

T a b le A .3: Visible to SW IR M LR Coefficients

A close look at the runway and the runway lights in the two images shows th a t they are
now registered. In particular, in the lower right corner of the SWIR image there are four
runway lights lined up horizontally. In the visible image there are three runway lights in
the same position, except the second light from the left is not visible. Figure A. 11 shows

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

105

the four lights differenced in a horizontal line with the missing visible light filled in from
the SWIR image. It is clear to see the warp performed during registration by observing
the timecode size and location differences in the differenced images. The timecodes are the
same size and at the same location in the original images. Figure A.9 and Figure A. 10 could
now be equally cropped to remove disjoint pixels around the perim eter to obtain the final
images to be fused.
VTCR 0 0 :2 5 . 14:18

F ig u r e
SW IR

A .9:

R ep eated

F ig u r e
visible

A .10:

M LR Reg.

F ig u r e A . 11: S W IR and
M LR Reg. visible

Next we applied the MLR algorithm to the registered visible and LWIR images using the
visible image as the baseline. Since: the runway lights are not visible in the LWIR image, we
use the intersecting lines at the bottom and top of the runway, and a stripe at the beginning
of the runway towards the right in the LWIR image as control points. Figure A. 12 repeats
the MLR registered visible image for reference. Figure A .13 is the registered LWIR image
and Figure A. 14 is the differenced registered visible and the just registered LWIR images.
The coefficients obtained are given in Table A.4.

s'
!/

K
8.347350
9.G37629

h
0.850628
-0.012015

h
0.037684
0.779082

T a b le A .4: LW IR to visible SW IR M LR Coefficients

As is evident in Figure A. 14 the runway portion of the LWIR image is aligned with the

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

106

runway portion of the registered visible image. Also, the runway lights from the visible
image border the perimeter of the LWIR runway. The one runway stripe selected as a
control point is aligned.

The taxiways on the right side of the image and the horizon

across the image are also aligned. Again, any disjoint pixels around the perim eter could be
removed by cropping. At this point all three original images are registered.

v rc a nn-^e

F ig u r e A . 12:
R epeated
M LR R egistered visible

F ig u r e
LW IR

A . 13:

M LR Reg.

F ig u r e A . 14: M LR Reg.
visible and LW IR

As a final test of the MLR algorithm we applied the same control point coefficients
to a later frame in the video sequence. Figures A .15,

A.16 and

A .17 are the SWIR,

LWIR and visible images at time 00:26:14:28, 10 seconds later in the sequence. Figure A .18
shows the MLR registered visible image. Figure A .19 is the differenced SWIR and MLR
registered visible images. Figure A.20 is the MLR registered LWIR image. Figure A.21 is
the differenced MLR registered visible and MLR registered LWIR image. As in the previous
set of images, the registration can be observed by noting the alignment of the runway and
runway lights in Figures A. 19 and A.21.

A .3 .3

D is c u s s io n

The images shown visually dem onstrate the performance of the two algorithms on typical
image d ata from the EVS. The registration inaccuracies of the SS algorithm are obvious.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

F ig u r e A . 15: O rig.
a t T im e 26:14:28

S W IR

F ig u r e A .16: Orig.
a t T im e 26:14:28

LW IR

F ig u r e A . 17: Orig. visible
a t T im e 26:14:28
r*n . o c

F ig u r e A . 18:
Tim e 26:14:18

M LR R egistered visible a t

F ig u r e A . 19:
visible

*a

’

Ccs

SW IR and M LR R egistered

The differing FOV and resolution specifications given do not take into account the other
distortions within the images.

W ith this much discrepancy there seems to be either a

fundamental problem in the bore-sighting or alignment of the cameras, or the alignment
is changing during flight. If the sensors were actually bore-sighted and aligned, the SS
algorithm should be able to m atch the performance of the MLR algorithm and in addition,
not require any manual intervention like selection of control points. True FOV values could
be obtained from a thorough ground calibration, and lion-interpolated pixels of the actual
image dimensions could be obtained from raw digital d ata stream s from the cameras.
The MLR algorithm provides b etter registration of the images than the SS algorithm
configured with the current set of specifications and with the current EVS alignment. In

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

108

F ig u r e A .20:
T im e 26:14:18

M LR R egistered LWIR a t

F ig u r e A .21: M L R R egistered visible and
LW IR

our examples the runway and runway lights are clearly aligned. The coefficients obtained
with only three points indicates th a t there are rotation, translation, scale and possibly shear
distortion components found between the images. These distortions can be seen by viewing
the timecode warps at the top of the differenced images. The application of the same MLR
control points to a set of time-aligned images later in the same video sequence produced
the same level of registration. This indicates th a t we could successfully use the registration
coefficients obtained from one set of time-aligned images to apply to, at least, a group of
frames from the video sequence. If the alignment is not changing substantially during flight
then all frames could be processed with the same transform.

A .4

S u m m a ry

Image registration is an essential prerequisite to subsequent image fusion. We have produced
two algorithms to perform multi-image registration for the EVS. The SS algorithm uses
EVS camera specifications and performs registration based solely on these param eters. The
performance of this algorithm indicates th a t there is a sever inaccuracy in the boresighting

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

109

or alignment of the cameras. Correction of these issues should improve the performance
of the algorithm and allow it to be used to autom atically register all images across the
cameras, or as validation of the MLR algorithm.
The MLR algorithm uses control point selection and linear regression to compute the
coefficients of an affine spatial transformation. This transform ation is then used to register
the LWIR and visible images to the SWIR image.

In addition, the MLR registration

algorithm provides a means to generate a base set of coefficients for post processing of the
full video stream across all cameras. We have subsequently used a set of baseline coefficients
to process an entire 20 second video clip from each of the three cameras.
In addition, the coefficients obtained could also be used to back out the actual distor
tion values (translation amount, rotation angle, etc.) for feedback to the EVS designers.
Improvements could also be made in the com putation of the coefficients by using point
selection with feedback or other more robust feature selection mechanisms. Manual con
trol point selection can be improved by MSR enhancement of the images to emphasize
and sharpen features prior to registration. This was done for another EVS d ata set and
greatly improved the ability to select corresponding points. Most importantly, the actual
boresighting and alignment can be checked against the values obtained from MLR and SS
registration, and adjusted appropriately. This procedure could be performed both before,
and after, EVS flight opportunities and used to verify and validate system alignment.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

A pp en dix B

Field Program m able G ate Arrays
In the future we can capitalize on the lessons learned from mapping the real-time Retinex
algorithm into a DSP architecture, and possibly use an alternative technology th a t will allow
full customization of our design. One of these technologies is field programmable gate arrays
(FPGAs). Architecture optimization usually implies performing the process of improving
a system by properly allocating resources, such as memory or DMA channels, to improve
execution speed or bandwidth. FPGAs redefine this term to apply at a much lower level
of abstraction. Specifically, FPG A s are composed of a large m atrix of logic cells, routing
resources, and I/O blocks th a t must, be selected, configured and interconnected. Figure B .l
is a block diagram of a typical FPG A architecture. A logic cell can be as simple as a
transistor pair or 2-input, nand gate, or as complex as a full microprocessor core. Logic cells
are typically based ou multiplexers and basic logic gates, or SRAM-based look-up tables
(LUTs), and are generally used to implement, combinatorial or sequential logic functions.
The routing resources implement the “field-programmable” portion of the FPG A defini
tion. They are the interconnect fabric (wires) and electrical switches th a t are programmed

110

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Ill

±

Logic Cells

/ "
I t H h HI—

—

-* ■
I/O Blocks

m
\

Interconnect

F ig u r e B .l : High-level block diagram of a typical F P G A A rchitecture

and can (usually) be reprogrammed in-situ, i.e.

after its m anufactured or even during

active operation. This concept leads to the idea of chip-level reconfigurable computing.
Three prim ary programming technologies are used to implement the switches [58]: pass
transistors controlled by the status of an SRAM bit. electrical programmable read only
memory (EPROM) floating-gate transistors, or small antifuse switches electrically formed
once by creating a low resistance path to ground. FPGAs th a t use write-once antifuses
are tedious to use because the design must be complete and verified before programming,
but they provide the benefits of low resistance and parasitic capacitance, high reliability
and density, and can be relatively easily fabricated in a radiation-hardened foundry. Xilinx
is large m anufacturer of SRAM-based FPGAs. the Altera Max products are CPLDs, and

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

112

Ac.tel is a m ajor vendor of antifuse based FPGAs. The I/O blocks are special-purpose logic
cells generally spread around the peripheral of the device that are used to buffer input and
output signals. They can usually be configured to transfer input, output, or bi-directional
signals.
The logical functionality of an FPG A is programmed into the device through a number
of design stages [51]. First, a high-level design is entered as a structural design, normally
through a schematic, or as a behavioral design using hardware description languages, such as
VHDL or Verilog. Computer-aided electronic design autom ation tools exist for both, and
often offer alternative entry methods, such as state-machine or waveform editors. Logic
synthesis is performed next, where the high-level design is compiled into a netlist and
translated into the available cells and technologies provided

011

the FPGA. Several issues

are addressed during this stage, such as design size checks and redundancy elimination.
After this, place and route is performed where cell placement is determined and the routing
interconnect is defined. Finally, a configuration bit file is generated and downloaded into the
device for programming. Because of the complexity of most FPGA architectures, functional
and timing simulations are often performed concurrently and iteratively with the design
stages. This allows the designer to correct errors before programming the device and is
critical to ensure the successful implementation of antifuse-based devices. Test vectors that
provide stimulus to both the simulator and actual device are also often generated to aid in
debugging and to verify and validate behavior.
FPGA capacities in the late 1980’s were on the order of thousands of usable gates [51].
They were often used as “glue logic” , absorbing the functionality of a variety of miscel
laneous logic, and performed functions like providing interfaces to external memories or

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

peripherals. Over the last few years the density and capabilities of FPGA s have increased
tremendously. As an example, the XC2VP100 is the largest FPG A in the Xilinx VirtexII Pro family of devices first introduced in January 2002. It contains 99,216 logic cells,
7,992-kbits of BRAM, 444 dedicated 18 x 18 multipliers, 12 digital clock managers (DCMs),
1164 user I/O pins, 2 PowerPC RISC processors, and 20 3.125 Gbps Rocket I/O serial
transceivers. Each logic cell contains a 4-input LUT, a flip-flop and carry logic. BRAM is
block RAM comprised of distributed and global dual-port SRAM. As FPG A densities have
increased, so have the number of potential uses. They are now often used as co-processors,
hardware accelerators, or custom, reconfigurable computing architectures. Several authors
have suggested and implemented individual image processing functions [45, 46, 11, 12] as
well as full platform and system solutions [1, 17, 38, 14, 68]. Xilinx and other vendors
offer several DSP cores, such as 2-D 1024-point, FFT s and YCrCb-to-RGB converters, that
perform complex processing functions and can be easily inserted into a design.
We could design and map a new version of the algorithm into this technology taking
advantage of its’ capabilities. We may be able to properly utilize a single high-density FPGA
to parallel process three spectral bands of image data. Otherwise, we could use a multiFPGA platform th a t would allow pipelining the m ajor components of MSCR processing.
Bandwidth issues could be reduced since we could create and optimize internal bus widths
to our d ata transfer requirements.

High-level code could be w ritten using VHDL and

synthesized using Synplicity FPG A development tools. We also have access to other highlevel tools, such as M atlab, th a t could be used for design, simulation, and test. An FPGA
Retinex processing core could eventually be developed for widespread use in other Xilinx
platforms.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

A pp en dix C

D M 642 E V M Flash Program m ing
G uidelines
Many embedded applications require the need to execute automatically at system power-up
after reset w ithout outside intervention. This is often accomplished by storing application
code in a non-violatile memory such as a read only memory (ROM) or flash memory. At
power-up (or boot) the stored code is autom atically copied into a runtime memory location
in random access memory (RAM) and then the beginning program address is branched to
to begin execution. We require this autom atic start-up capability for our DM642 EVM
based implementation used in the EVS system. The EVS system is required to work as an
embedded, autonomous system. Power-up and power-down cycles are performed frequently
during pre-flight check-outs and when the plane has stopped at other airport facilities. No
operator is continually available to monitor the system and repeatedly reload code from a
host, therefore loading and executing code autonomously is required.
The DM642, like all the other TI DSPs, has a set of facilities to support bootloading.
114

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

115

Three boot configuration modes are supported — no boot, ROM boot,, and host boot. In no
boot mode no action is performed at boot, and in host boot, an external host controls the
boot process. In ROM boot mode after reset is released, the CPU is stalled until 1-KByte
of memory is copied from the beginning of an external ROM to RAM address 0 using the
EDMA controller. After this transfer is complete the CPU is released and starts to execute
code at address 0.
Many applications will not fit in 1-KByte of memory. In this case, the code th a t is
copied is usually a second-level bootloader th a t in turn, copies the rest of the application
into RAM. The DM642 EVM has 4-MBytes of 8-bit wide flash. The flash is mapped into
the 0x90000000 to 0x9007FFFF (lower CE1 space) address range of the DM642 using 19
address bits (A0-A18). This is smaller than the memory space available in the flash so
an FPG A on the EVM is used to create 3 additional address lines extending the address
range to 4-MBytes. These 3 lines effectively act as page bits dividing the address space
into 0.5-MBvte pages. Unfortunately, they default to 000 at power-on reset because the
SRAM-based FPG A becomes unconfigured at reset and tri-states the output of all I/O . We
discuss the ramifications of this next.
The size of most of the first executable files generated for our implementations were
about 500-KBytes, and would fit on the first page of flash. After adding ethernet service
components (and the large libraries required by them) to allow a user to perform param eter
updates from any available laptop, our executable code size grew to ss 677-KBytes. To
place these executables in flash memory requires using a Flashburn utility provided by TI.
This utility requires the file to be burned to be in one of several specific formats. We choose
the hex format and used the TI supplied liexbx utility to perform the conversion.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

116

TI supplies a sample second-level boot loader (boot.asm) assembly file th a t is used to
load a users application. W hen the boot.asm file is included with the application, the liexbx
conversion routine properly allocates the boot code (at address 0x00) and the application
code (defaulting to address 0x400, immediately after the 1-KByte boot code). When burned
into flash, the flashburn utility physically places the code in memory according to the values
in the hex file and any offsets selected at burn time by the user.
The EVM board m anufacturer,Spectrum Digital, supplies a default flash program th a t
contains the configuration bits (of size 0x393D0 or ~ 234-KByt,es) for the FPGA, and a
program (fpgaJoader) th a t loads the FPGA with these configuration bits. So the default
setup would have boot.asm at address 0, the fpgaJoader code at address 0x400 and the
configuration bits at address 0x40000. The boot code would be loaded into RAM at reset.
After reset, it would copy the fpgaJoader code into RAM and branch to the entry point of
the fpgaJoader, which subsequently loads the configuration bits into the FPGA.
The main issue is th a t the FPGA controls the addressing used for the flash (the page
bits) and if it is not properly configured, the upper pages of flash cannot be accessed. So
attem pting to use flash above page 0 (above 512-KBytes) becomes a lion-trivial issue. Our
first attem pts failed because our standard routine had been to erase flash, burn the FPGA
configuration bits and our application (with boot.asm embedded in it), and then restart
the system. This worked because our application fit on one page of flash. Now th a t our
code was greater than one page, the burn failed because the address lines are all at zero and
addresses above 512-KBytes are mapped into lower memory. In addition after getting data
burned on more than one page, we needed a method to copy the upper pages of information
into RAM and restart execution.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

117

Our solution was to first, burn the fpgaJoader with the default boot code and the FPG A
configuration bits into the first page of flash and reboot thus configuring the FPGA. This
gives us full access to all of flash memory. We then modified the default fpgaJoader program
so th at at the end of the program execution it now (1) changes the FPG A page bits from 0 to
1 and (2) branches to a third-level bootloader at address 0x90000000 to load our application.
To change the register th a t controls the page bits to 1 we use a function supplied in the
dm642 board support library: evmdm642_rset(evmdm642Jlashpage,l). To branch, we use
three simple assembly language instructions in C code: asm (“ MVKL 0x90000000,A15” );
asm (“ MVKL 0x90000000,A 15” ); and asm (“ BNOP,0x5” );.
Next we burned into memory the FPG A configuration bits at 0x90040000, the modified
fpgaJoader with the default second-level boot code, and our application code embedded with
the third-level boot code at address 0x90080000. The address change is performed using
an offset of (0x80000) in the Flashburn utility. W hen the modified fpgaJoader branches to
the address 0x90000000 with the page bits set to page 1, we are actually addressing address
0x80000 of the flash. The third-level bootloader simply loads into R AM our application from
flash address OxCOOOO, branches to the start of the application code and begins execution.
Finally, here are a few miscellaneous notes on the discussion above. W hen burning the
configuration bits, the default hex file already has the 0x40000 offset built in so nothing
has to be done to place the d ata there in the Flashburn utility. Similarly the default and
modified fpgaJoader hex files instruct Flashburn to place the boot code at 0x00 and the
application code at 0x400. The application hex hie is also built under the assumption th a t
the boot code is placed at 0x00 and the application code is placed at 0x400. We force the
Flashburn tool to provide the offsets of 0x80000 and 0x80400 respectively.

Reproduced with permission o f the copyright owner. Further reproduction prohibited without permission.

118

Our current m ethod executes the third-level bootloader out of slow flash memory. Al
though this only requires a few seconds, we could speed up the loading process by copying
third-level bootloader (the first 1-KByte of memory at 0x80000) into RAM in the same way
th at the first bootloader does at power-up. Executing the third-bootloader out of RAM
would then provide faster loading time. The information we developed for this guide will
be used in a new TI application report on bootloaders for their COX processors.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

B ibliography
[1] A. L y n n A b b o t t , P e t e r M . A t h a n a s , a n d A d it T a i i m a s t e r . Accelerating
image filters using a custom-computing machine. In Field Programmable Gate Arrays
(FPGAs) fo r Fast Board Development and Reconfigurable Computing, John Schewel,
editor, volume 2607, pages 62-70. Proceedings of SPIE, October 1995.
[2] I rv in g A s h k e n a s , H e n r y R. J e x ,

and

D u a n e T. M c R u e r . Pilot-induced oscil

lations: their cause and analysis. Technical Report N orthrop Corp. 64-143, NASA
Langley Research Center, June 1964.
[3] A T E M E SA. IEKC64x users manual. Technical report, ATEME, Bievres, France,
2003.
[4] A n d r e w B a t e m a n a n d I a in P a t e r s o n -S t e p h e n s . The D SP Handbook: Algo
rithms, Applications, and Design Techniques. Prentice Hall, 2002.
[5] R. B r a c e w e l l , O. B u n e m a n , H. H a o , a n d J. V i l l a s e n o r . Fast, two-dimensional
Hartley transform. In Proceedings of the IEEE, volume 74, No. 9, pages 1282-1283,
September 1986.
[6] THOMAS B r a u n l . Parallel Image Processing. Springer, 2000.
[7] E . O r a n B r i g h a m . The Fast, Fourier Transform. Prentice-Hall, 1975.
[8] R i c h a r d R. B r o o k s a n d S. S. I y e n g a r . Multi-Sensor Fusion: Fundamentals and
Applications. Prentice Hall, 1998.
[9] L isa G o t t e s f e l d B r o w n . A survey of image registration techniques. In AC M
Computing Surveys, volume 24, No.4, December 1992.
[10] H o w a r d B u r d i c k . Digital Imaging. McGraw Hill, 1997.
[11] C h ris D i c k . Computing multi-dimensional DFTs using Xilinx FPGAs. In The 8th,
International Conference on Signal Processing Applications and Technology, Septem
ber 13 -16 1998.
[12] C h r is D i c k . Minimum multiplicative complexity implementation of the 2-D DCT
using Xilinx FPGAs. In Co7ifigurable Computing: Technology and Applications. Pro
ceedings of SPIE’s Photonics East, November 1-6 1998.

119

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

120

[13] N eil A n t h o n y D o d g s o n . Image resampling. Technical Report 261, University of
Cambridge, United Kingdom, August 1992.

[14] B r u c e A. D r a p e r , J. R o ss B e v e r i d g e , A. P . W ill e m B o e h m , C h a r l e s R o s s ,
AND M o n i c a C h a w a t h e . Accelerated image processing on FPGAs. IE E E Transac
tions on Image Processing, 12(12), December 2003.
A connection between bit reversal and m atrix transposition:
Hardware and software consequences. IE E E Transactions on Acoustics, Speech, mid
Signal Processing, 38(11):1893 1896, November 1990.

[15] PlERRE D u h a m e l .

[16] J. E k l u n d g . A fast computer method for m atrix transposing. IE E E Transactions
on Computers, C-21:801-803, 1972.
[17] L e e F e r g u s o n . Image processing using reconfigurable FPGAs. In High-Speed Com
puting, Digital Signal Processing, and Filtering Using Reconfigurable Logic, John
Schewel, Peter M. Athanas, V. Michael Bove, Jr., and John Watson, editors, vol
ume 2914, pages 110-121. Proceedings of SPIE, November 1996.
[18] L e l ia M. G. F o n s e c a a n d B. S. M a n j u n a t h . Registration techniques for multisen
sor remotely sensed imagery. In Photogrammetric Engineering and Remote Sensing,
volume 62, No. 9, pages 1046-1056, September 1996.
[19] J a c k G. G a n s s l e . The A rt of Programming Embedded Systems. Academic Press,
1992.
[20] R a f a e l C. G o n z a l e z
Addison-Wesley, 1993.
[21]

and

R i c h a r d E. W o o d s .

Digital Image Processing.

Image registration by local approximation methods. In Image
Vision Computing, volume 6, pages 255-261, November 1988.

A RD E SH IR G OSHTASBY.

[22] D avid B. H a r r i s , J a m e s H. M c C l e l l a n , D a vid S. K. C h a n , a n d H a ns W .
S c h u e s s l e r . Vector radix fast fourier transform. In IE E E International Conference
on Acoustics. Speech, and Signal Processing, pages 548-551, 1977.
[23] G l e n n D. H i n e s , Z i a - u r R a h m a n , D a n ie l J . J o b s o n , a n d G l e n n A. W o o d e l l .
Multi-sensor image registration for an enhanced vision system. In Visual Information
Processing XII. Proceedings of SP IE 5108, Zia-ur Rahman, R obert A. Schowengerdt,
and Stephen E. Reichenbach, editors, April 2003.
[24] G l e n n D. H i n e s , Z i a - u r R a h m a n , D a n ie l J . J o b s o n , a n d G l e n n A. W o o d 
e l l . DSP implementation of the retinex image enhancement algorithm. In Visual
Information Processing XIII. Proceedings of SP IE 5488, Zia-ur Rahman, Robert A.
Schowengerdt, and Stephen E. Reichenbach, editors, April 2004.
[25] G l e n n D. H i n e s , Z i a - u r R a h m a n , D a n ie l J . J o b s o n , a n d G l e n n A. W o o d 
e l l . Single-scale retinex using digital signal processors. In Global Signal Processing
Conference, September 2004.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

121

[26] G l e n n D . H i n e s , Z i a - u r R a i i m a n , D a n ie l J. J o b s o n , G l e n n A. W o o d e l l ,
a n d S t e v e n D. H a r r a h . Real-time enhanced vision system. In Enhanced mid
Synthetic Vision. Proceedings of SP IE 5802, Jacques G. Verly, editor, March 2005.

[27] F r i e d r i c h O. H u c k , C a r l L. F a l e s , a n d Z ia - u r R a h m a n . Visual Communica
tion: A n Information Theory Approach. Kluwer Academic, 1997.
[28] IE E E . IEEE Std 1149.1 standard test access port and boundary-scan architecture.
Technical Report SSYA002C, IEEE, New York, New York, 1993.
[29] K e i t h J a c k . Video Demystified. Brooktree, 1993.
[30] A n il K. J a i n . Fundamentals o f Digital Image Processing. Prentice-Hall, 1989.
[31] J o h n R. J e n s e n . Introductory Digital Image Processing. Prentice Hall, 1996.
[32] D a n i e l J. J o b s o n , Z i a - u r R a h m a n , a n d G l e n n A. W o o d e l l . A multi-scale
Retinex for bridging the gap between color images and the human observation of
scenes. IE E E Transactions on Image Processing: Special Issue on Color Processing,
6(7):965-976, July 1997.
Properties
and performance of a center/surround retinex. IEEE Trans, on Image Processing,
6(3):451-462, March 1997.

[33] D a n ie l J . J o b s o n , Z i a - ijr R a h m a n ,

and

G l e n n A. W o o d e l l .

[34] G a r y V. K e l l o g a n d C h a r l e s A W a g n e r . Effects of update and refresh rates on
flight simulation visual displays. Technical Report 100415, NASA Langley Research
Center, February 1988.
[35] J o s e f K i t t l e r a n d M i c h a e l J. B. D u f f . Image Processing System Architectures.
Research Studies Press, 1985.
[36] S u n Y u an K u n g . VLSI Array Processors. Prentice-Hall, 1988.
[37] E d w a r d L a n d . An alternative technique for the computation of the designator in
the retinex theory of color vision. In Proceedings of the National Academy of Science,
volume 83, pages 3078-3080, 1986.
[38] P h il l i p L a p l a n t e a n d W ill ia m G i l r e a t h . Single instruction set architectures for
image processing. In Reconfigurable Technology: FPGAs and Reconfigurable Proces
sors fo r Computing and Communication IV, John Schewel, Philip B. James-Roxby,
Herman Sclmiit, and John T. McHenry, editors, volume 4867, pages 20-29. Proceed
ings of SPIE, July 2002.
[39] P h il l i p A . L a p l a n t e a n d A l e x a n d e r D . S t o y e n k o . Real-Time Imaging: The
ory, Techniques, and Applications. IEEE Press, 1996.
[40] J o n C. L e a c h t e n a u e r . Electronic Image Display. SPIE Press, 2004.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

122

[41] Hui Li, B. S. M a n j u n a t h , a n d S a n j i t K. M i t r a . A contour-based approach to
multisensor image registration. IE E E Transactions on Image Processing, 4, No. 3,
March 1995.
[42] H ui H e n r y L i a n d Y i - T o n g Z h o u . Autom atic visual/ir image registration. In
Optical Engineering, volume 35(2), pages 391-400, February 1996.
[43] R e n C. L u o a n d M i c h a e l G. K a y . Multisensor Integration and Fusion fo r Intel
ligent Machines and Systems. Ablex, Norwood, NJ, 1995.
[44] D u a n e T. M c R u e r . Pilot-induced oscillations and human dynamic behavior. Tech
nical R eport NASA-CR-4683, NASA Langley Research Center, July 1995.
[45] L es M i n t z e r . The FPG A as F F T processor. In 6th International Conference on
Signal Processing Applications and Technology, pages 1378-1382, October 1995.
[46] L es M i n t z e r . Large F F T ’s in a single FPGA. In 7th International Conference on
Signal Processing Applications and Technology, volume 1, pages 895-899, October
7-10 1996.
[47] K u r t N o v a k . Rectification of digital imagery. In Photogrammetric Engineering and
Remote Sensing, volume 58, No. 9, pages 399 344, March 1992.
[48] H e n r i J. N u s s b a u m e r .
Springer Verlag, 1981.
[49] A l a n V. O p p e n h e i m
Hall, 1975.

and

Fast Fourier Transform and Convolution Algorithms.

R o n a l d F. S h a f e r . Digital Signal Processing. Prentice-

[50] S t e p h e n K. P a r k a n d R o b e r t A. S c h o w e n g e r d t . Image reconstruction by
param etric cubic convolution. In Computer Vision. Graphics, and Image Processing,
volume 23, pages 258-272, 1983.
[51] D a v id P e l l e r i n AND M i c h a e l H o l l e y . Practical Design Using Programmable
Logic. Prentice Hall, 1991.
[52] I o a n n i s P itas a n d M i c h a e l G. S t r i n t z i s . Algorithms for the reduction of the
1-0 operations in the calculation of the 2-D DFT. In Signal Processing, volume 12,
pages 277-289, 1987.
[53]

C H A R L E S A. P o y n t o n . Digital Video and H D T V Algorithms and Interfaces. John
Wiley & Sons. 2003.

[54] WILLIAM K. P r a t t . Digital Image Processing. John W iley and Sons, 1991.
[55] RAHMAN, see h ttp :/ / d r a g o n .l a r c .n a s a .g o v for examples.

[56] Z ia - u r R a i i m a n , D a n ie l J. J o b s o n , a n d G l e n n A. W o o d e l l . Retinex processing
for autom atic image enhancement. Journal of Electronic Imaging, 13, No. 1:100-110,
January 2004.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

123

[57] Z ia - u r R a h m a n , D a n ie l J. J o b s o n , G l e n n A. W o o d e l l , a n d G l e n n D. H i n e s .
Multi-sensor fusion and enhancement using the retinex image enhancement algorithm.
In Visual Information Processing XI, Proceedings of SP IE 4736, Zia-ur Rahman,
Robert A. Schowengerdt, and Stephen E. Reichenbach, editors, April 2002.
[58] Z o r a n S a l c i c AND A sim S m a i l a g i c . Digital Systems Design and Prototyping Using
Field Programmable Logic and Hardware Description Languages. Kluwer Academic,
2000 .
[59] R o b e r t S c h o w e n g e r d t . Remote Sensing: Models and Methods fo r Image Process
ing. Academic Press, 1997.

[60] RAVI K. S h a r m a a n d M i s h a P a v e l . Multisensor image registration. In SID Digest.
Society fo r Information Display, volume XXVIII, pages 951-954, May 1997.
[61]

J

u l i u s

O.

S

m i t h

.

Mathematics of the Discrete Fourier Transform (DFT). W3K,

2003.
[62] M i c h a e l J o h n S e b a s t i a n S m i t h . Application-Specific Integrated Circuits. AddisonWesley, 1997.

[63]

W . S m i t h . The Scientist, and Engineer's Guide to Digital Signal Processing.
California Technical, 1997.
S

t e v e n

[64] W

in t iir o p

W . S m ith

and

J o a n n e M. S m i t h . Handbook o f Real-Time Fast Fourier

Transforms. IEEE Press, 1995.
[65] G e o r g e W . S n e d e c o r . Statistical Methods. Iowa State University, 8 edition, 1989.
[66] S p e c t r u m D i g i t a l . TMS320C6713 DSK technical reference.
506735-0001, Spectrum Digital, Stafford, Texas, May 2003.
[67]

Technical Report

S p e c t r u m D i g i t a l . TMS320DM642 evaluation module technical reference. Techni
cal Report 506845-0001, Spectrum Digital, Stafford, Texas, August 2003.

[68] O l a f S t o r a a s l i . Computing faster without CPUs: Scientific applications on a
reconfigurable FPGA-based hypercomputer. In 6th Military and Aerospace Pro
grammable Logic Devices Conference, September 2003.
[69] S a b i n e S u s s t r u n k , R o b e r t B u c k l e y ,

a n d

S t e v e S w e n . Standard R G B color

spaces. In The Seventh Color Imaging Conference: Color Science, Systems and A p 
plications. IS AT - The Society for Imaging Science and Technology, 1999.
[70]

T e x a s I n s t r u m e n t s .
TVP3026 d ata manual video interface palette.
Report SLAS098B, Texas Instrum ents, Dallas, Texas, July 1996.

[71]

T e x a s I n s t r u m e n t s . TMS320C6000 technical brief. Technical Report SPRU197D,
Texas Instrum ents, Dallas, Texas, February 1999.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Technical

124

[72]

T e x a s I n s t r u m e n t s . TMS320C6000 imaging developer’s kit (IDK) video device
driver user’s guide. Technical Report SPRU499, Texas Instrum ents, Dallas, Texas,
December 2000.

[73]

T

[74]

T e x a s I n s t r u m e n t s . TVP5022 d ata manual NTSC/PA L video decoder. Technical
R eport SLAS274, Texas Instrum ents, Dallas, Texas, July 2000.

[75]

T e x a s I n s t r u m e n t s . Implementing fast fourier transform algorithms of real-valued
sequences with the TMS320 DSP platform. Technical Report SPRA291, Texas In
struments, Dallas, Texas, August 2001.

[76]

TMS320C6000 imaging developer’s kit (IDK) program m er’s
guide. Technical Report SPRU495A, Texas Instrum ents, Dallas, Texas, September
2001.

[77]

T e x a s I n s t r u m e n t s . TMS320C6000 imaging developer’s kit (IDK) user’s guide.
Technical Report SPRU494a, Texas Instrum ents, Dallas, Texas, September 2001.

[78]

T

[79]

T

e x a s I n s t r u m e n t s .
T M S 3 2 0 C 6 4 x technical overview.
SPRU395B, Texas Instrum ents, Dallas, Texas, January 2001.

Technical Report

[80]

T

e x a s
I n s t r u m e n t s . TMS320 D SP/BIO S user’s guide.
SPRU423B, Texas Instrum ents, Dallas, Texas, November 2002.

Technical Report

[81]

T

[82]

T e x a s
I n s t r u m e n t s . TMS320C6000 program m er’s guide.
SPRU198G, Texas Instrum ents, Dallas, Texas, August 2002.

[83]

T E X A S IN S T R U M E N T S . TMS320C64x DSP library programm er’s reference. Technical
Report SPRU565A. Texas Instrum ents, Dallas, Texas. April 2002.

[84]

T e x a s I N S T R U M E N T S .
TMS320C6713 floating-point digital signal p rocessor data
manual. Technical Report SPRS186B, Texas Instrum ents, Dallas, Texas, November

TMS320C6711 floating-point digital signal processor data
manual. Technical Report SPRS073D, Texas Instrum ents, Dallas, Texas, September
2000 .

T

e x a s

e x a s

I

I

.

n s t r u m e n t s

n s t r u m e n t s

.

e x a s I n s t r u m e n t s . TMS320C6000 peripherals reference guide. Technical Report
SPRU190D, Texas Instrum ents, Dallas, Texas, February 2001.

e x a s
I n s t r u m e n t s . TMS320C6000 chip support, library API user’s guide. Tech
nical Report SPRU401E, Texas Instrum ents, Dallas, Texas, December 2002.

Technical Report

2002 .

[85]

T

e x a s
I n s t r u m e n t s . The TMS320C67x FastRTS library program m er’s reference.
Technical Report SPRU100A, Texas Instrum ents, Dallas, Texas, October 2002.

[86]

T e x a s I n s t r u m e n t s . TMS320DM642 Video/Imaging fixed-point digital signal pro
cessor d ata manual. Technical R eport SPRS200B, Texas Instrum ents, Dallas, Texas,
July 2002.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

125

[87] T e x a s I n s t r u m e n t s . TMS320C6000 DSP cache user’s guide. Technical Report
SPRU656A, Texas Instrum ents, Dallas, Texas, May 2003.
[88] T e x a s I n s t r u m e n t s . TMS320C621x/C671x DSP two-level internal memory refer

ence guide. Technical R eport SPRU609A, Texas Instruments, Dallas, Texas, Novem
ber 2003.
[89] T e x a s I n s t r u m e n t s . TMS320C64x DSP video Port/V C X O interpolated control
(VIC) port, reference guide. Technical R eport SPRU629, Texas Instrum ents, Dallas,
Texas, April 2003.
[90] T e x a s I n s t r u m e n t s . TMS320C67x DSP library program m er’s reference guide.
Technical R eport SPRU657, Texas Instrum ents, Dallas, Texas, February 2003.
[91] T e x a s I n s t r u m e n t s . TVP5150A d ata manual. Technical Report SLES087, Texas
Instrum ents, Dallas, Texas, September 2003.
[92] T e x a s I n s t r u m e n t s . M igrating from TMS320C6211B/C6711/C6711B and C6713
to TMS320C6713B. Technical Report SPRA851G, Texas Instrum ents, Dallas, Texas,
March 2004.
[93] T e x a s I n s t r u m e n t s . TMS320C6000 optimizing compiler user’s guide. Technical
Report SPRU187L, Texas Instrum ents, Dallas, Texas, May 2004.
[94]

The TMS320C6000 PLL controller reference guide. Technical
R eport SPRU233, Texas Instrum ents, Dallas, Texas, March 2004.
T E X A S IN S T R U M E N T S .

[95] T ex a s I n s t r u m e n t s . TMS320C64x DSP two-level internal memory reference guide.
Technical R eport SPRU610A, Texas Instrum ents, Dallas, Texas, June 2004.
[96] T e x a s I n s t r u m e n t s . TVP5146 data manual. Technical Report SLES084A, Texas
Instrum ents, Dallas, Texas, November 2004.
[97] T e x a s I n s t r u m e n t s . TMS320C6416T fixed-point digital signal processor d ata man
ual. Technical R eport SPRS226H, Texas Instrum ents, Dallas, Texas, August 2005.
[98] C a r l o L. M. T i a n a , J. R i c h a r d K e r r , a n d S t e v e n D. H a r r a i i . M ultispectral
uncooled infrared enhanced vision system for flight test. In Proceedings of SP IE ,
volume 4363, April 2000.
[99] T

ruV ie w .

see h t t p :/ / w w w .t r u v i e w . c o m .

[100] J o h n W a t k i n s o n . The Art, of Digital Video. Focal Press. 1990.
[101] DlCK R. W

itt in k .

The Application of Regression Analysis. Allyn and Bacon, 1988.

[102] G e o r g e W o l b e r g . Digital Image Warping. IEEE Computer Society Press, 1990.

Reproduced with permission o f the copyright owner. Further reproduction prohibited without permission.

126

[103] G l e n n A. W o o d e l l , D a n i e l J. J o b s o n , Z i a - u r R a h m a n , a n d G l e n n D. H i n e s .
Enhanced images for checked and carry-on baggage and cargo screening. In Sensors,
and Command, Control, Communications and Intelligence (C3I) Technologies
fo r Homeland Security and Homeland Defense III, Proceedings of SP IE 5403, April
2004.
[104] Z h a o Z h a n g a n d X i a o d o n g Z h a n g . Fast, bit-reversals on uniprocessors and sharedmemory multiprocessors. S IA M Journal on Scientific Computing, 22(6):2113 2134,
2001 .

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

127

VITA

Glenn Derrick Hines

Glenn Derrick Hines was born in Portsm outh, Virginia on August 25, 1964. He grad
uated from I. C. Norc.om High School, Portsm outh, Virginia, in 1982. He received his
Bachelor of Science degree in Electrical Engineering from Old Dominion University in 1987,
his M aster of Science degree in Electrical Engineering from Old Dominion University in
1991, and his M aster of Science degree in Computer Science from The College of William
and Mary in 2002. Glenn defended his dissertation in January 2006 and will graduate with
a Doctor of Philosophy degree in Com puter Science from The College of William and Mary
in May 2006. Glenn is employed as a senior electronics engineer and com puter scientist at
NASA Langley Research Center in Hampton, Virginia. He performs research in the area
of image processing and is responsible for the development of aviation, spaceflight, and a t
mospheric research instruments. He is married to the former Sunita Etwaroo and has three
children.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

