Implementation of an error-robust bucket-method algorithm for elaboration of white light interferometry data on GPGPUs by Schneider, Max & Fey, Dietmar
PROCEEDINGS 
 
 
 
  
 
 
 
 
 
 
13 - 17 September 2010 
 
 
Crossing Borders within the ABC 
 
Automation, 
Biomedical Engineering and 
Computer Science 
 
 
 
Faculty of  
Computer Science and Automation 
 
 
 
www.tu-ilmenau.de  
 
 
 
Home / Index: 
http://www.db-thueringen.de/servlets/DocumentServlet?id=16739 
55. IWK
Internationales Wissenschaftliches Kolloquium
International Scientific Colloquium
Impressum 
Published by 
 
Publisher: Rector of the Ilmenau University of Technology 
Univ.-Prof. Dr. rer. nat. habil. Dr. h. c. Prof. h. c. Peter Scharff 
 
Editor: Marketing Department (Phone: +49 3677 69-2520) 
Andrea Schneider (conferences@tu-ilmenau.de) 
 
 Faculty of Computer Science and Automation 
(Phone: +49 3677 69-2860) 
Univ.-Prof. Dr.-Ing. habil. Jens Haueisen 
 
Editorial Deadline:  20. August 2010 
 
Implementation:  Ilmenau University of Technology 
Felix Böckelmann 
Philipp Schmidt 
 
 
USB-Flash-Version. 
 
Publishing House: Verlag ISLE, Betriebsstätte des ISLE e.V. 
Werner-von-Siemens-Str. 16 
98693 llmenau 
 
Production:  CDA Datenträger Albrechts GmbH, 98529 Suhl/Albrechts 
 
Order trough:  Marketing Department (+49 3677 69-2520) 
Andrea Schneider (conferences@tu-ilmenau.de) 
 
ISBN: 978-3-938843-53-6 (USB-Flash Version) 
 
 
Online-Version: 
 
Publisher: Universitätsbibliothek Ilmenau 
  
Postfach 10 05 65 
 98684 Ilmenau 
 
 
© Ilmenau University of Technology (Thür.) 2010 
 
The content of the USB-Flash and online-documents are copyright protected by law. 
Der Inhalt des USB-Flash und die Online-Dokumente sind urheberrechtlich geschützt. 
 
 
Home / Index: 
http://www.db-thueringen.de/servlets/DocumentServlet?id=16739 
IMPLEMENTATION OF AN ERROR-ROBUST BUCKET-METHOD ALGORITHM
FOR ELABORATION OF WHITE LIGHT INTERFEROMETRY DATA ON GPGPUS
Max Schneider, Dietmar Fey
Chair of Computer Science 3 (Computer Architecture)
Friedrich-Alexander-University Erlangen-Nuremberg
Martensstr. 3, 91058 Erlangen
ABSTRACT
3D surface analysis of objects presents one of the
most challenging tasks in optical metrology concern-
ing the required computing power. In particular white
light interferometry causes a sophisticated evaluation
due to the high data volume occurring during the mea-
surement. In principle, the scanning process can be ac-
celerated by actual cameras. We show that the elabo-
ration of the incoming data can even not satisfactorily
be fulﬁlled by utilization of current multi-core technol-
ogy if the scanning process is integrated in manufac-
turing processes. However, by appointing and assign-
ing computational heavy tasks to co-processor devices
like GPGPUs (General-Purpose Graphics Processing
Unit) a rapid evaluation can be realized. We demon-
strate that for the so-called ﬁve-bucket preprocessing
method about 2 GB measured data can be processed
in ms range using NVIDIA’s CUDA (Compute Uniﬁed
Device Architecture) technology. Due to the efﬁcient
utilization of CUDA-capable GPU devices a large-scale
speed-up in comparison to a parallel implementation
with OpenMP and SSE is achieved.
Index Terms— White light interferometry, Five-
Bucket-Algorithm, GPGPUs, CUDA, OpenMP, SSE
1. INTRODUCTION
The scanning device used in white light interferometry
is usually an adapted Michelson interferometer equipped
with a broadband light source and a video camera or
digital imaging system (CCD array) for capturing in-
terferometry images (see ﬁgure 1) [1]. The white light
beam is split in two paths. One path is projected onto
the reference mirror, the other one to a in z-direction
automatic moveable translation arm carrying the object
to measure. If the optical path length difference of both
beams is less than the coherence length, interferences
arise [2]. For a complete surface inspection it is nec-
essary that the whole interference range is covered in
the measurement process. Sampling frames are taken
at discrete z-positions, so that for each pixel the auto-
correlation function of the original beam is obtained
Fig. 1. Michelson Interferometer.
Fig. 2. A synthetic interferogram signal of a pixel.
[3]. The resulting signal, shown in ﬁgure 2, is the so
called interferogram or correlogram.
A discrete height value is assigned to each frame
captured in z-direction. With the knowledge about the
scale from the transmission ratio of the translation arm
the height of the pixels can be determined [1]. For that,
the center of interference in each pixels’ interferogram
has to be computed. Since the interference is to ob-
serve only within a certain number of frames around
the interference center a data reduction can be applied
in a preprocessing step. This is necessary since a com-
plete measurement process can span thousand of pic-
tures with XGA (1024 × 768) resolution and above,
866
in 8-Bit or 16-Bit color depth. This would result in
a data volume of 750 MB (XGA/8-Bit/1000 pictures)
or 1.5 GB (XGA/16-Bit/1000 pictures), respectively of
mostly irrelevant data. For an up-to-date computer sys-
tem it’s not a problem to hold this amount of data, but if
the measurement process needs several thousands im-
ages such systems would also run out of memory. Fur-
thermore, the calculation of the height map is a com-
putational intensive process, so that the elaboration of
irrelevant data should be avoided.
2. GENERAL-PURPOSE GRAPHICS
PROCESSING UNITS AND CUDA
Since their introduction GPUs have been consistently
optimized to do fast calculations, e.g. on 3D image
data for computer games or video processing. Mean-
while GPUs evolved to very high performance mas-
sively parallel architectures, that are capable of execut-
ing many of hundreds threads in parallel, to do SIMD-
(Single Instruction Multiple Data)-like processing of
incoming data. Furthermore, processing pipelines of
modern GPUs are programmable, allowing the usage
for general-purpose computations in applications be-
yond their intended purpose (e.g. scientiﬁc algorithms,
economy simulations).
In 2007 NVIDIA introduced a new parallel com-
puting architecture called CUDA [4]. With that, a sim-
ple programming access to device resources through
C/C++ libraries is given, hiding the complex graph-
ics related interface. CUDA-capable devices possess a
number of SIMT (Single Instruction Multiple Thread)
multiprocessors. While SIMT is an analogue concept
to SIMD, for the realization of data-parallel computa-
tions, CUDA-device multiprocessors consist (in con-
trast to conventional vector processors) of a set of scalar
processors. When a data-parallel program is executed,
a number of threads, e.g. one thread per data element, is
instantiated and each thread is assigned to a scalar pro-
cessor, executing the same program on different data
in a SIMD-fashion. Since thread assignment is nonde-
terministic, non-adjacent data can be assigned to scalar
processors of the same multiprocessor. Due to the im-
plementation of GPUs as an aggregation of several mul-
tiprocessors, consisting of multiple scalar processing
elements, a corresponding memory hierarchy is used.
So called global or device memory is accessible by all
processors of the device. Further, each multiprocessor
has its own shared memory and read-only caches for
constants and textures. Those are accessible from all
scalar processors within the multiprocessor [5].
In contrast to a thread scheduling on operating sys-
tem level, thread scheduling on a GPU requires much
lower switching time. Allocated threads are grouped
in thread blocks and mapped onto SIMT multiproces-
sors. Thread blocks are furthermore divided in pieces
called thread warps which are set active and piecewise
executed on scalar processors of corresponding mul-
tiprocessors. If a thread of an active warp waits for
data during its execution, this thread is suspended while
the memory management unit (MMU) gets data from
global memory and another thread is scheduled onto
that processor. By that, data transfers latencies can
be completely hidden [5]. Threads of the same block
communicate with each other through the fast shared
memory. Communication between threads of differ-
ent blocks is accomplished through the global memory,
causing much slower access compared to the shared
memory inside a multiprocessor.
By coalescing additional performance is attainable
during accesses to the global memory. If neighboring
threads of a multiprocessor make requests for data lo-
cated consecutively in global memory, those requests
are packed and executed together, resulting in up to 16
times higher transfer performance. The base address
must yet satisfy the alignment requirements for coa-
lesced transfers (base address must be multiple of 16
times the size of requested values type size) [6].
3. ERROR-ROBUST FIVE-BUCKET-METHOD
To cover the whole interference range, needed for a
complete surface inspection, hundreds or thousands of
images have to be captured and vertical movement steps
of the translation arm are necessary. In order to meet
strict timing constraints data amount should be reduced,
if possible. This can be achieved by identifying, for
the height map calculation, relevant fragment (primary
height map) of each pixels’ correlogram [7]. This pri-
mary height map corresponds to the correlogram inter-
val where the optical path length difference of both re-
ﬂected beams is shorter than the coherence length of
the used light beam. It is obtained through calculation
of the interference pattern center in each pixels’ correl-
ogram and extracting of intensities in a speciﬁed range
around that center. Furthermore, incoming data has to
be processed simultaneously to the scanning of the ob-
ject, to get results as soon as possible.
The calculation of height maps for an object con-
sists of a preprocessing step, the extracting of the rele-
vant correlogram part, and a postprocessing step, where
by denoising and correction of determined correlogram
segments the actually height is obtained. Depending on
the surface characteristics and the signal-to-noise ratio
of the generated signals, different approaches for pri-
mary height maps calculation have to be considered
[1, 7]. Since scanning of rough surfaces is naturally
error-prone an error-robust algorithm is required to de-
termine the correct fragments out of resulting correlo-
gram signals [8, 9].
An elementary way to determine the center of a
correlogram is based on simple maximum search [10],
where only a pixel-wise comparison is required. Unfor-
tunately this simple approach can lead to missclassiﬁed
867
correlogram intervals, which even the following post-
processing step cannot compensate. Therefore, error-
robust procedures like the ﬁve-bucket-method are re-
quired [7, 9]. This algorithm computes so called bucket
values, by applying equations (1) to (4) to each pixels’
correlogram intensities. When a new bucket is calcu-
lated, it is compared to the current maximum bucket.
In case that, the new value is greater, the maximum
bucket and value of the center of interference (set to
frame index z) are updated.
DF02(z) = I(z + 0)− I(z + 2) (1)
DF13(z) = I(z + 1)− I(z + 3) (2)
DF24(z) = I(z + 2)− I(z + 4) (3)
B(z) =
1
2
q
DF13(z)2 − DF02(z) · DF24(z) (4)
where z is the number of the current translation step,
B(z) bucket value of the current frame, I(z+k) with k∈
{0...4} pixels intensity in corresponding frame. Based
on the calculation method in equation (4), B(zi) can
be computed by using the contrast values DF13(zi−1)
and DF24(zi−1) of B(zi−1) as values for DF02(z) and
DF13(z) accordingly, leading to equations (5) to (8).
DF02(zi) = DF13(zi−1) (5)
DF13(zi) = DF24(zi−1) (6)
DF24(zi) = I(zi + 2)− I(zi + 4) (7)
B(zi) =
1
2
q
DF24(zi−1)2 − DF13(zi−1) · DF24(zi) (8)
with i ∈ {1...levels − 1}. Therefore in the next step
only the contrast value DF24(zi) has to be calculated,
reducing the complexity of the maxima calculation. Since
the difference operations corresponds to a high-pass
characteristic, the ﬁve-bucket-approach shows an in-
creased error-robustness [1]. Derived from given equa-
tions, at ﬁrst images of all translation steps are cap-
tured and each pixels’ correlogram is elaborated after
another.
The only dependency in above equations exists be-
tween corresponding pixels of successive images. Thus,
the parallel calculation of each pixels’ relevant correl-
ogram interval is a crucial option to gain better perfor-
mance and drastically reduces run times for the prepro-
cessing phase. In the following two parallel implemen-
tations of the ﬁve-bucket-algorithm are presented. The
ﬁrst one integrates OpenMP and SSE (Intels SIMD-
solution) and was tested on a Nehalem-Quad-Core i7
(920) 2.66 GHz system with 12 GB DDR3 RAM. The
second version was build upon NVIDIAs CUDA-frame-
work and tested on a Tesla C1060 with 4 GB GDDR3
RAM and a GeForce GTX 480 Fermi with 1.5 GB
GDDR5 RAM. Although parallelization by OpenMP
and SSE gives several times speed-up against sequen-
tial approach, much higher performance boost can be
achieved by harvesting GPGPUs features (separate mem-
ory, higher memory transfer rates, several hundred scalar
processors).
3.1. Using OpenMP and SSETechnics for Optimiza-
tion on Multi-Core Systems
OpenMP is an Application Program Interface (API) for
multi-platform shared-memory parallel programming
in C/C++ and Fortran on all conventional architectures.
By applying OpenMP pragmas on speciﬁc code sec-
tions, e.g. loops, sequential programs can be automat-
ically parallelized onto underlying multi-core architec-
tures, without a huge programming effort [11].
To get any performance boost by OpenMP not a
pixel-wise but a frame-wise processing is to prefer, since
the measurement data of all translation stages is stored
frame by frame in main memory, i.e that intensities be-
longing to one pixel are not stored contiguous. There-
fore, pixel-wise processing results in accesses to not
continuous main memory addresses and therefore in a
high cache miss count. In contrast, in a frame-wise
processing each image is elaborated completely, before
intensities of the next image are touched. At ﬁrst ﬁve
translation steps are executed and the corresponding in-
tensity data stored in system memory. These values
are used in an initialization step, where for each pixel
the ﬁrst bucket is calculated and the center of interfer-
ence value initialized. While the values are computed,
the next frame is captured. After the initialization pro-
cedure for each pixel is ﬁnished, stored data of levels
two to six are used to calculate the second bucket and
updating maximum bucket and height values. During
this step captured intensity image is stored in the mem-
ory space of level one, which data is not needed any-
more. This procedure is repeated until all images are
processed and the primary height maps generated. By
reusing the allocated memory only space for six suc-
cessive levels is needed, in contrast to the pixel-wise
processing, which has required the complete loading of
all frames.
Figure 3a illustrates on the Nehalem-Quad-Core sys-
tem achieved run times for the OpenMP implementa-
tion of the center of interference computation. It is
obvious, that due to the independency in the elabora-
tion of each pixels’ interferogram, the more threads are
used the better performance is attainable. In ﬁgure 3b
shown speed-ups approve this statement, as up to four
threads a nearly linear speed-up is gained. However,
with enabled HyperThreading similar but not better re-
sults were achieved, which is explainable by the over-
head arose by switching between threads on the same
processor core.
For white light interferometry used cameras usu-
ally operate in 8-bit or 16-bit color depth. During the
maximum analysis procedure, ﬂoating-point interme-
diate results with single or even double precision are
required. Hence, intensities values must be converted.
For the 64-bit Core i7 architecture this circumstance
doesn’t have a signiﬁcant effect on the performance
of the OpenMP implementation, because all scalar op-
868
erations are executed internally on 64-bit wide regis-
ters. Thus, independent of the used color depth and
the precision of the computations analogue run times
and speed-ups were achieved. However, in the case of
SSE-support this conversion has an enormous impact
on processing time and programmer’s effort, too.
0.125
0.25
0.5
1
2
4
8
16
32
64
1 2 3 4 5 6 7 8
Ti
m
e
(s
)
Threads
320x240
640x480
800x600
1024x768
1024x1024
(a) Runtimes of the OpenMP implementation.
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
2 3 4 5 6 7 8
Sp
ee
d-
U
p
Threads
1.99x
2.96x
3.84x
2.56x
3.00x
3.53x
3.91x
(b) Average speed-ups of the OpenMP implementation
Fig. 3. Runtimes and corresponding speed-ups of the
OpenMP implementation in comparison to sequential
implementation (1000 levels / single precision).
Since introduction of the Multi Media Extension
(MMX) by Intel in 1997 [12] and its successor Stream-
ing SIMD Extensions (SSE) in 1999 [13] almost all
CPU vendors decided to include SIMD instruction ex-
tensions in their processor-architectures to accelerate
data-parallel processing [14]. Intel’s enhancement to
an SSE instruction set x86 architecture leaded to addi-
tional eight directly addressable 128-bit SIMD general-
purpose registers and new instructions that work on
packed ﬂoating-point and integer data. The current im-
plementation (SSE4) provides 16-way, 8-way, 4-way
and 2-way integer and 4-way or 2-way ﬂoating-point
vector instructions, so multiple data elements are pro-
cessed in a single cycle, increasing the performance of
data-parallel applications several times over that of the
sequential counterpart-algorithms [15].
By harvesting SSE-technics the ﬁve-bucket-method
can gain an additionally performance boost. However,
during primary height map elaboration, needed data
types conversion has negative effects. For a proper pro-
cessing of vector elements, SIMD-instructions require,
that all in the same operation used vector values con-
tain the same number of scalar units [15]. If the com-
plete analysis procedure could be accomplished with-
out any usage of ﬂoating-point values in each run of the
inner loop 16 or 8 pixels would be processed in paral-
lel, yielding a huge increase in the achievable perfor-
mance. This is not possible yet, because even without
the square root, storing of intermediate results requires
at least 32-bit integer values. Thus, the 16 or 8 ele-
ments of a vector, accordingly to the used color depth,
have to be unpacked to new vector values, each holding
only four or two elements (dependent on the required
precision) and processed after another.
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
1 2 3 4 5 6 7 8
Ti
m
e
(s
)
Threads
320x240
640x480
800x600
1024x768
1024x1024
(a) Runtimes of the OpenMP implementation
with enabled SSE-processing.
0
1
2
3
4
5
6
7
8
9
1 2 3 4 5 6 7 8
Sp
ee
d-
U
p
Threads
2.46x
4.41x
6.25x 6.20x 5.77x 6.02x 6.01x 5.95x
(b) Average speed-ups of the OpenMP implementation
with enabled SSE-processing.
Fig. 4. Runtimes and corresponding speed-ups of the
OpenMP implementation with SSE-support in compar-
ison to sequential implementation (1000 levels / single
precision).
The measured run times shown in ﬁgure 4a show
obviously that even though the additional effort of un-
packing vector elements to new vectors a further per-
formance increase compared to pure OpenMP solutions
without SSE is feasible. In single precision mode al-
ready three OpenMP threads with enabled SSE elab-
orate data faster than the memory bus is capable of
providing the data (see ﬁgure 4b). Hence, adding fur-
ther threads don’t give any signiﬁcant speed-up. On
the other side, vectorized computations in double preci-
sion are not worthwhile, because the unpacking process
(e.g. with 8-bit intensity data, eight extracting steps
have to be executed) and the following two-way pro-
cessing of pixel data needs more time, than a sequen-
tial elaboration of the same data amount requires. Thus,
if double precision is not necessarily needed, combina-
tion of OpenMP with single precision SIMD-operations
should be preferred, otherwise only the OpenMP prag-
869
mas should be applied.
3.2. Accelerating by GPGPU utilization
The scanning process can span thousands of translation
steps with images in XGA resolution and above. Based
on the minimal run times (1.93 seconds for 1000 im-
ages with megapixel resolution and 1.33 seconds for
images with XGA resolution) achieved by the multi-
threaded and vectorized implementation, alone the com-
putation of the interference center would require sev-
eral seconds. However, the postprocessing of the pri-
mary height maps is even more complex, thus the whole
surface analysis process is not feasible in real-time. Fur-
thermore, the center of interference for each pixels cor-
relogram is determined by a simple calculation, but a
huge data volume has to be processed and a lot of in-
termediate results have to be stored. Because GPUs
possess their own memory with even higher bandwidth
than conventional memory offers and provide higher
compute capability compared to conventional CPUs,
it is reasonable to assign the task of computing each
pixels interference center to GPUs. By that relocation,
host system resources are freed, increasing other tasks
execution performance.
To get results as soon as possible, processing of in-
coming picture data has to be done parallel to the actual
scanning process. Due to the fact, that functions (ker-
nels) executed on GPUs are called from the CPU pro-
gram space but executed asynchronous to other func-
tion calls, overlapping of routine execution on GPUs
and CPUs is allowed. Thus, required parallel process-
ing of incoming data to the scanning process is achiev-
able. Before data can be elaborated yet, it must be ﬁrst
transfered from the system memory to that of the GPU,
because they have no direct access to main memory.
Since transfers of small data packages don’t utilize the
PCIe bus efﬁciently, data of a predeﬁned image num-
ber is stored linearly in host’s memory ﬁrst. During
the next translation step is performed, it is transfered
at once to GPUs memory. As soon as the data trans-
fer is complete, the bucket-algorithm is applied on that
data. A further obvious advantage of separate mem-
ory spaces is that all intermediate results (these are not
anymore needed after the preprocessing step) are also
stored in the same memory. Therefore the host system
performance is not affected.
Considering the OpenMP implementation approach
discussed above, based on the focus on architecture de-
sign of NVIDIAs Tesla C1060, the processing manner
on GPGPUs has to be reconsidered. The Tesla de-
vices provide read-only constant and texture caches, al-
lowing fast access to measurement data. However due
to the indispensable usage of intermediate values, also
write access is required. Without adjustments the pro-
gram execution would results in high memory trafﬁc,
induced by data transfers between registers of the scalar
processors and global memory. Because each pixels in-
terferogram is processed independently from those of
other pixels, the simplest, but also to the best of our
knowledge the most efﬁcient way is to instantiate for
each pixel it’s own thread.
Threads grouped in the same thread block have ac-
cess to 16 KB register set (32-bit each) [5]. By spec-
ifying the thread block size accordingly to this con-
straint and holding the needed values inside these reg-
isters during computation, all intermediate data can be
kept inside registers without being ﬂushed out. Thus,
number of accesses to global memory is reduced. Fur-
thermore, by assigning horizontally adjacent pixels to
threads with successive ids, due to linear storing of pic-
ture data, a coalesced data access is achieved.
Through utilization of GPGPUs as co-processor units
and outsourcing the described preprocessing method
from conventional CPUs, an enormous speed-up in com-
parison to the sequential or even multi-threaded and
vectorized algorithms was achieved (see ﬁgure 5a.)
0
50
100
150
200
250
1 2 3 4 5 6 7 8 Tesla Fermi
Sp
ee
d-
U
p
OpenMP Threads / GPGPU Device
2.46x 4.41x 6.25x 6.20x 5.77x 6.02x 6.01x 5.95x
169.18x
254.00x
(a) Speed-up comparison between single precision OpenMP
implementation with enabled SSE-processing to CUDA version.
0
5
10
15
20
25
30
35
40
45
50
2 3 4 5 6 7 8 Tesla Fermi
Sp
ee
d-
U
p
OpenMP Threads / GPGPU Device
1.98x 2.95x 3.82x 2.56x 3.00x 3.55x 3.96x
18.51x
43.81x
(b) Speed-up comparison between double precision
OpenMP implementation to CUDA version.
Fig. 5. Achieved average speed-ups through utilization
of GPGPUs (Tesla C1060/GeForce 480GTX) com-
pared to those of OpenMP implementations (1000 lev-
els).
Although the Tesla C1060 devices are specialized
on operation execution in single precision mode and
offer just a peak double precision performance lesser
than one tenth of the peak single precision performance
[4], their assignment for double precision procedures is
also worthwhile due to the marginal programming ef-
fort offered by CUDA-framework. As shown in ﬁgure
5b, through utilization of the Tesla C1060 resources al-
870
most ﬁve times higher speed-up in comparison to the
best performing double precision OpenMP implemen-
tation was gained.
With the new graphic chip generation “Fermi“ (e.g.
GeForce 480 GTX) NVIDIA introduced cached global
memory accesses. This allows better usage of com-
munication paradigmas. Additional scalar processors,
higher performance capability for single as well as for
double precision computations are also common [4].
As shown in ﬁgure 5, on the GeForce 480 GTX over-
all performance of the ﬁve-bucket-algorithm increased
on a large scale. E.g. for elaboration of 1000 images,
each with a resolution of 1024× 1024 and 16-bit color
depth (ca. 2 GB measured data), run times in single
precision mode about 0.038 seconds are achieved (on
Tesla C1060 0.065 seconds). The CUDA implementa-
tion beneﬁts from new features offered by Fermi GPUs
even though no further program adjustements were in-
duced and no recompiling conducted.
4. CONCLUSION AND FUTURE WORK
In this work, we presented two approaches which en-
able efﬁcient execution of the data-parallel ﬁve-bucket-
preprocessing algorithm of the white light interferome-
try process. On the one side it was shown that through
utilization of multi-core systems the analysis phase can
achieve a several times speed-up compared to a con-
ventional implementation. On the other side system re-
sources (in particularly main memory) needed for other
tasks are restrained, reducing the overall system perfor-
mance. In the second implementation with NVIDIAs
CUDA-framework on graphic devices provided features
were harvested to relaxe the system demands of the
ﬁve-bucket-method. Attained performance of the CUDA
implementation reveals the high potential of outsourc-
ing computational tasks in industrial image processing
to co-processor devices like GPGPUs.
In the future, a Multi-GPU and a heterogeneous
(cooperational work of GPUs with conventional CPUs
or other multi-core architectures) solution of the pre-
sented algorithm will be examined, providing further
details needed for efﬁcient utilization of parallel archi-
tectures. Also the postprocessing step algorithms will
be implemented using NVIDIAs CUDA-framework, al-
lowing encapsulation of the complete height map elab-
oration process within GPUs.
5. REFERENCES
[1] M. Hißmann, Bayesian Estimation for White
Light Interferometry, Ph.D. thesis, Combined
Faculties for the Natural Sciences and for Math-
ematics of the Ruperto-Carola University of Hei-
delberg, Germany, 2005.
[2] E. Hering and R. Martin, Photonik - Grundlagen,
Technologie und Anwendung, Springer Berlin
Heidelberg, 2006.
[3] D. Kapusi and T. Machleidt, “White
Light Interferometry in Combination with a
Nanopositioning- and Nanomeasuring Machine
(NPMM),” Proceedings of the International
Society for Optical Engineering, June 2007.
[4] NVIDIA, “CUDA ZONE,” online.
[5] NVIDIA, “NVIDIA CUDA C Programming
Guide 3.1.1,” Tech. Rep., NVIDIA, 2010.
[6] A. Cevahir, A. Nukada, and S. Matsuoka, Compu-
tational Science – ICCS 2009, vol. Fast Conjugate
Gradients with Multiple GPUs of Lecture Notes in
Computer Science, Springer Berlin / Heidelberg,
2009.
[7] W. Sang, “Entwicklung und Implemen-
tierung eines Verfahrens zur Auswertung von
Weißlichtinterferogrammen zur Bestimmung der
dreidimensionalen Oberﬂächentopographie von
Mikro- und Nanostrukturen als Anwendung fur
eine Nanopositionier- und Nanomessmaschine,”
M.S. thesis, Technische Universität Ilmenau
Fakultät für Elektrotechnik und Informationstech-
nik, 2006.
[8] D. W. Robinson, Interferogram Analysis: Digital
Fringe Pattern Measurement Techniques, Insti-
tute of Physics Publishing, 1993.
[9] K. G. Larkin, Topics in Multi-dimensional Sig-
nal Demodulation, Ph.D. thesis, The Faculty of
Science in the University of Sydney, 2000.
[10] Z. Sarac, R. Groß, C. Richter, B. Wiesner, and
G. Häusler, “Optimization of White Light Inter-
ferometry on rough Surfaces based on Error Anal-
ysis,” Optik - International Journal for Light and
Electron Optics, vol. 121, pp. 351–357, 2010.
[11] Barbara Chapman, Gabriele Jost, and Ruud
van der Pas, Using OpenMP - Portable Shared
Memory Parallel Programming, The MIT Press,
2007.
[12] A. O. Ramirez, “An Overview of Intel’s MMX
Technology,” LINUX JOURNAL, vol. 61, 1999.
[13] M. Srivastav, “Vectorization: Writing C/C++
Code in VECTOR Format,” Tech. Rep., Intel
Software Network, 2009.
[14] J. Stewart, “An Investigation of SIMD Instruc-
tion Sets,” Research Project on the University of
Ballarat School of Information Technology and
Mathematical Sciences, November 2005.
[15] Intel Corporation, Intel 64 and IA-32 Architec-
tures Optimization Reference Manual, 2009.
871
