High-Performance Deformable Image Registration for Medical Applications by Benton, R. Andrew II
High-Performance Deformable Image Registration for Medical
Applications
A Thesis
Submitted to the Faculty
of
Drexel University
by
R. Andrew Benton II
in partial fulfillment of the
requirements for the degree
of
Master of Science in Computer Engineering
June 2015
c© Copyright 2015
R. Andrew Benton II. All Rights Reserved.
TABLE OF CONTENTS
List of Figures iv
Abstract v
1 Introduction 1
2 Related Work 4
3 Thin Plate Spline Algorithm 6
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2 Coefficient Matrix Construction . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.3 Thin-Plate Spline Transform . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4 Landmark and Control Point Registration 10
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.2 Marching Cubes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.3 Triangle Decimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.4 Selective Control Point Activation . . . . . . . . . . . . . . . . . . . . . . . 13
5 Implementation 14
5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.2 CPU Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
TABLE OF CONTENTS iii
5.3 MIC Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.4 Fast Math . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.5 Vectorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.6 Cilk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
6 Results 21
7 Conclusions 27
8 Future Work 29
List of References 31
Vita 32
LIST OF FIGURES
3.1 2D Radial Basis Function (RBF) for comparison. Coefficients in the matrix
and offset weights in the thin-plate spline warp are calculated using the 3D
radial basis function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
6.1 Showing the runtime for volumes of different volume sizes and architectures
and the corresponding VCPT. . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.2 Different views of the right lung from the body cavity. Control points are
marked in yellow as the current points, and red as the target points. The
majority of the movement comes from the diaphram. . . . . . . . . . . . . . 23
6.3 Display of original, warped and target lung slices. Note that in the warped
slice (b), the diaphragm has been adjusted to match the target (c). . . . . . 24
6.4 Difference in an image slice of the body cavity before and after the thin-plate
spline warp. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.5 Effect of vectorization and OpenMP on thin-plate spline warp. . . . . . . . 25
6.6 Comparison between time and error when switching between math.h’s logf
function, and the fastlog function. . . . . . . . . . . . . . . . . . . . . . . 26
ABSTRACT
High-Performance Deformable Image Registration for Medical Applications
R. Andrew Benton II
James Anthony Shackleford, Ph.D
Spline based deformation methods are common in medical imaging due to their robust-
ness and accuracy. These methods require a significant amount of computation time in
order to obtain a sufficiently accurate spatial deformation between two or more image vol-
umes. This work focuses on comparing the thin-plate spline registration technique against
the bspline on a variety of hardware architectures. The architectures in question include the
common i5/i7 architecture with single and multiple threads, nVidia’s CUDA architecture
on the GTX760, and Intel’s Xeon Phi architecture.

CHAPTER 1: INTRODUCTION
Image registration, in general, focuses on aligning multiple images to one another based on a
vector field that defines a spatial deformation. In medical imaging, this can provide insight
into the patient’s state for physicians and technicians, in addition to providing valuable
medical information without the need of as much interference from human agents.
Image registration can aid before and during the treatment process. During the treat-
ment process, it can be used to map between previously obtained images and images that
are currently being obtained. Prior to treatment, this can also be used to map landmarks
and annotations from one image volume to another. This annotation is usually performed
by a trained technician in order to highlight points of interest including major organs and
tumors. Automating this process can aid in increasing clinical throughput, as each volume
will usually take a human two to three hours to complete.
Ideally, image registration would achieve a perfect match between the two images modal-
ities, providing a vector deformation map without errors. In reality, this will never happen,
as the material being imaged will experience a change in content between when the images
are captured. Instead, most registration techniques propose to create a solution and then
iteratively improve the proposed transform to minimize the degree of error. Overtime such
an iterative process can converge to the global minimum of error by making alterations in
the proposed control point parameters such that the regions in the area of each control
1. INTRODUCTION 2
point reaches a smaller error value.
A bspline deformation works by imposing a grid of control points over the volume to be
deformed, and then adjusting the coefficients at these grid points until the resultant spatial
deformation reaches the desirable range. The size of the grid may change, and increasing
the size of the grid causes the spatial transform to be more localized, and allows for more
rapid shifts. While the bspline transform is differentiable it is only twice differentiable. The
value for the spatial transform is determined by passing the coefficients into the bspline
function, often cubic, to determine the spatial shift at each voxel. The disadvantage to
using the bspline then is that the locality of the deformation is limited by the degree to
which the bspline can handle higher noise deformations.
An alternative method of spatial deformation is the thin plate spline. The thin plate
spline works by selecting pairs of points on a global level. These pairs of points are used on a
global level to calculate a deformation. Each point affects every point in the volume, albeit
to different degrees based on the distance from the point. At least d+1 points are required
when working with the thin plate spline, where d is the number of dimensions in the data.
The benefit of using the thin plate spline is that the resultant deformations are smooth and
infinitely differentiable. The downside to the thin plate spline is that the runtime of the
algorithm on the volume is linear with respect to the number of control points used and the
determination of the warping weights is calculated with the LU decomposition. As such,
control points must be chosen sparingly.
Error is typically measured using either mean-squared-error (MSE) or mutual-information
(MI). There is a trade off between these two error measuring techniques. The MSE method
is very fast and works very well with parallelization, however it only works well when com-
1. INTRODUCTION 3
paring image volumes that use the same modality, or use very similar modalities. The MI
technique is very well suited to matching image volumes across different modalities, but it
is harder to parallelize due to the MI technique requiring that a probability distribution
function (PDF) is generated which relates the probabilities of voxel values between the
image volumes.
Past works in the area of medical image registration have focused on the speed and
quality of the produced transforms. These transformations are computationally intensive,
and cannot be run in realtime except on the most powerful hardware. To this end, it
is desirable to find solutions that may be able to compute these transformations using
accelerator hardware to increase the rate at which the transformation can be obtained.
In Chapter 2, topics relating to this work are discussed, including the bspline, thin-plate
splines on the CUDA architecture, and the MIC architecture. Chapter 3 introduces theory
of the thin-plate spline algorithm. Chapter 4 discusses the use of landmarks and control
points in the thin-plate spline algorithm, including the way that the size of the problem
may be constrained by selectively activating and deactivating control points based on their
redundancy. In Chapter 5 the implementation of the thin-plate spline on the CPU and
MIC architecture is discussed, including the way in which the thin-plate benefits from the
heavy vectorization potential of the Xeon Phi. Chapter 6 presents and examines the results
of the thin-plate spline for accuracy given a set of points, as well as the run time of the
warp. Chapter 7 analyzes the results of Chapter 6 and presents suggestions and conclusions.
Finally, Chapter 8 suggests the direction of future work on the thin-plate spline algorithm,
proposing a global registration technique.
CHAPTER 2: RELATED WORK
Previous work with medical image registration has examined a wide range of transformations
and similarity metrics [1, 2]. The type of transform is generally dependent on the expected
perturbation of the image volumes. Global and local transforms both have their place in
the registration process, and have distinct advantages over the other. Similarity metrics are
also considered in these studies, as the error metric has a great deal of influence over the
improvement of the transform [3].
Spline based algorithms are very popular transform methods for medical image regis-
tration. They provide smooth, continuous, differentiable functions that map between image
volumes [2]. The bspline algorithm is a local deformable transform technique that provides
different levels of fitting based on the degree of the basis spline used. Bspline algorithms
have been shown to be well suited to high degrees of change, and can be adjusted by humans
to work at a global level by adjusting the coefficient grid [3].
Thin plate spline algorithms are another technique for applying spline functions to
achieve a transform effect using a series of control points [4]. These thin plate splines are
always global in nature, and are modelled after the physical properties of a bending plate
of metal. The weakness of these thing plate splines is that the transformation weights must
be calculated using a matrix inverse function, and even afterward, the thin plate spline has
a linear runtime with respect to the number of control points. To this effect, methods have
2. RELATED WORK 5
been developed to reduce the computational weight of the algorithm by approximating the
thin plate spline [5].
Error metrics for image volumes may include both the Mean Squared Error (MSE) and
the Mutual Information (MI) technique. While the MSE method is less computationally
intensive than the MI technique, the MI technique is able to provide more useful information
across multiple image modalities [3]. Error metrics using the MSE technique are much less
robust, and provide less feedback about the quality of the transform, as the metric is
insensitive to the location of errors and the distribution of errors.
Given the computational size of volumetric image registration, it is desirable not only
to find a good algorithm, but also to find a fast and efficient hardware platform. While
each of the techniques is able to run on a single core processor, this will not produce
competitive performance [6]. Toward this end, various hardware architectures have been
examined including distributed systems and Nvidia’s CUDA architecture [3, 7, 8]. An
additional hardware platform that has not been explored (that I have been able to find)
is Intel’s MIC (Many Integrated Core) architecture. This architecture uses a many core
heterogeneous architecture, taking a middle road between Nvidia’s CUDA and a server level
Xeon processor [9, 10]. The architecture of the Intel Xeon Phi allows for each core to act
independently of all other cores, unlike with CUDA where stream processors are expected
to have a similar control flow. Additionally, the Xeon Phi is able to take advantage of
vector processing of 512 bit vectors to attain a high throughput should the transform be
susceptible for vectorization.
CHAPTER 3: THIN PLATE SPLINE ALGORITHM
3.1 Overview
The TPS algorithm used here is based off of a 2D version of the TPS algorithm. In 2D
versions of the TPS algorithm, three or more points are used to calculate a warping plane
that defines a spatial transform on a 2D surface. Here, this is altered such that a warping
solid is created, which requires four or more points. These points may be designated by a
user, or created algorithmically. After the TPS algorithm runs, a set of weights in the x,
y, and z direction are created. These weights are then used to compute the deformation of
the volume. The number of weights is based off of the number of points given to the TPS
algorithm. As such, adding frivolous points will hurt the performance of the algorithm,
as the TPS matrix is solved with LU decomposition. A method to decimate these points
is discussed later. The TPS algorithm itself is unable to do more than create a spatial
deformation when given a set of control points. In order to create the control points, there
needs to be an algorithm that attempts to perform a rough mapping between the two
volumes.
There are a variety of simplifications that can be done to make the thin plate spline
faster. It is possible to sacrifice quality of the transform in order to gain a substantial
increase in the performance of the algorithm by calculating the magnitude of the weights
3. THIN PLATE SPLINE ALGORITHM 7
returned by the TPS algorithm, and then removing some of the least significant points.
Other methods exist that revolve around removing redundant control points in order to
reduce the runtime of the TPS transform. If control points are not reduced such that the
least significant points are suspended, then the runtime of the transform will massively
dominate the runtime of the whole workflow. The runtime of the TPS transform on a
volume is O(mn), where m is the number of control points, and n is the number of voxels
in the transformed image volume. Considering that at least 8 points are needed in order
to hold the volume in place, this number will quickly rise, necessitating a simplification
algorithm.
3.2 Coefficient Matrix Construction
The weights for the thin-plate spline algorithm are derived from a matrix solution. This
matrix solution models the distance between each point using a radial basis function. In
equation 3.1, this models the K submatrix. The submatrix element values at [i, j] can be
determined to be the RBF of the distance between control point i and control point j. As
such, the diagonal of K will be zeros. The values in P are the x, y, z, and a zero value
for each control point. Submatrix O is a 4x4 matrix filled with zeros. w is the resultant
weights that will be used for solving the transform. Vector v is the initial displacement of
the vectors from their original position. Finally, vector o is a 4x1 vector filled with zeros.
Given this linear system of equations, the weights for the thin-plate spline transform can
be determined by using LU-decomposition.
[
K P T
P O
] [
w
]
=
[
v
o
]
(3.1)
3. THIN PLATE SPLINE ALGORITHM 8
Figure 3.1: 2D Radial Basis Function (RBF) for comparison. Coefficients in the matrix and
offset weights in the thin-plate spline warp are calculated using the 3D radial basis function.
3.3 Thin-Plate Spline Transform
Given the weights that are extracted from 3.1, the resultant volume can be determined
by performing a voxel-wise operation. For each voxel, the offset from the current voxel is
determined as the static offset and linear offset from the original point. The offset if further
modified by scaling each weight by the value of the distance between the current voxel and
the landmark analogous to that TPS weight. Using this offset, which is a floating point
value in the x, y, and z direction, and adding it to the current position, the new voxel value
at that position is calculated as the interpolation of the target point. This interpolation
may either be a rounding, or any other type of interpolation. In the implementation here,
it is a trilinear interpolation. A simplified version of the volumetric thin-plate spline warp
can be seen in figure 3.1.
3. THIN PLATE SPLINE ALGORITHM 9
Listing 3.1: Single-threaded volumetric thin-plate spline warp
1 void warp_volume(float *input , float *output , float *weight_x , float *
weight_y , float *weight_z , long int control_points , long int *dim ,
float *coeffs , float span , float *spacing , float *origin)
2 {
3 float pxyz [3] = {0.0f, 0.0f, 0.0f};
4 for(int x = 0; x < dim [0]; x++)
5 {
6 pxyz [0] = (origin [0] + (spacing [0] * x)) / span;
7 for(int y = 0; y < dim [1]; y++)
8 {
9 pxyz [1] = (origin [1] + (spacing [1] * x)) / span;
10 for(int z = 0; z < dim [2]; z++)
11 {
12 pxyz [2] = (origin [2] + (spacing [2] * x)) / span;
13 float sum [3] = {weight_x[control_points +0],
14 weight_y[control_points +0],
15 weight_z[control_points +0]}
16 sum [0] += weight_x[control_points +1] * pxyz [0];
17 sum [1] += weight_y[control_points +1] * pxyz [1];
18 sum [2] += weight_z[control_points +1] * pxyz [2];
19 sum [0] += weight_x[control_points +2] * pxyz [0];
20 sum [1] += weight_y[control_points +2] * pxyz [1];
21 sum [2] += weight_z[control_points +2] * pxyz [2];
22 sum [0] += weight_x[control_points +3] * pxyz [0];
23 sum [1] += weight_y[control_points +3] * pxyz [1];
24 sum [2] += weight_z[control_points +3] * pxyz [2];
25
26 for(int i = 0; i < control_points; i++)
27 {
28 float scale = logf (0.00001 +
29 ((( coeffs [(i*6)+0] - pxyz [0])*( coeffs [(i*6)] +
0] - pxyz [0])) +
30 (( coeffs [(i*6)+1] - pxyz [1])*( coeffs [(i*6)] +
1] - pxyz [1])) +
31 (( coeffs [(i*6)+2] - pxyz [2])*( coeffs [(i*6)] +
2] - pxyz [2]))));
32 sum [0] += weight_x[i] * scale;
33 sum [1] += weight_y[i] * scale;
34 sum [2] += weight_z[i] * scale;
35 }
36
37 float sxyz [3] = {
38 ((( pxyz [0] + sum [0]) * span) - origin [0]) / spacing
[0]
39 ((( pxyz [1] + sum [1]) * span) - origin [1]) / spacing
[1]
40 ((( pxyz [2] + sum [2]) * span) - origin [2]) / spacing
[2]
41 }
42
43 trilinear_interpolate(input , output , dim , sxyz);
44 }
45 }
46 }
47 }
CHAPTER 4: LANDMARK AND CONTROL POINT
REGISTRATION
4.1 Overview
When using a bspline registration method, the control points and coefficients are integrated
into the structure of the transform in a regular grid, and each need to be tweaked to achieve
the desired transform. All points in the transform are then able to be calculated based off
of this predictable number of points that dictate the transform. The TPS algorithm does
not have this regular grid of control points, nor would it help if it did. The TPS algorithm
needs to find critical points between the deformed and the target volumes and then match
them. It is possible to do this with pixels only, however attempting to create a transform in
this method will result in a slow, and likely inaccurate transform. Given that certain critical
points on the grid need to be found for proper registration to take place, two methods are
discussed here which can be used.
The first such method creates a rough mapping between the two volumes by using
pattern recognition techniques to identify similar regions of pixels, and then maps those
regions of pixels to one another with a single control points. This method works very well
with rough, and even very large deformations, however it doesn’t work as well with smaller
or sharp deformations. The matching may be based on any similarity metric, however it is
beneficial to use a metric that weights distance against other features of the region to try
4. LANDMARK AND CONTROL POINT REGISTRATION 11
to find an optimal region mapping.
An alternative method that may produce a closer mapping between the two objects with
a degree of structure involves using mesh matching to reduce the matching optimization
to a 3D triangle mesh. This triangle mesh is first obtained by using the marching cubes
algorithm on an identified isosurface in the volume. The created triangle mesh is then
decimated, in order to remove unnecessary points from the image. When the number of
points has reached a sufficiently low number, the mesh matching algorithm is run, which
will then create a match of the mesh structure.
4.2 Marching Cubes
The marching cubes algorithm creates a triangle mesh based upon a given 3D isosurface.
The isosurface in this application is obtained either through human set bounds, or by passing
a canny filter over the image, and then bounding the high noise areas. The selected surface
is then analyzed. Multiple triangles can be created in each cube, depending on the inclusion
and exclusion of different vertices from the mesh. In order to obtain the most accurate mesh,
the mesh creation grid should have a high number of points, up to a maximum of one per
slice in each dimension. Further resolution provides no benefit. This process is linear with
respect to the number of resultant grid dimensions.
More accurate meshes from the marching cubes algorithm are highly desirable for later
steps, even when considering the decimation algorithm, and perhaps especially considering
the decimation algorithm. By having a high number of original created triangles in the
mesh, the decimation algorithm will be able to accurately represent the created mesh. If
low numbers, relative to the dimensions of the volume, are chosen instead, then it is possible
4. LANDMARK AND CONTROL POINT REGISTRATION 12
that some critical information may be lost, which would then defeat the purpose of using a
mesh matching scheme.
4.3 Triangle Decimation
Triangle decimation algorithms are frequently used to simplify the representation of 3D,
and especially so to increase the speed of rendering. In this case, triangle decimation is
instead used to remove extraneous information from the mesh until the mesh finally reaches
a desired number of points, or all remaining points are too critical to remove. Past results
[source] have shown that decimation algorithms are able to remove as many as 90% of the
primitives present in an original mesh while retaining visual fidelity, and can remove further
points and still retain a high degree of critical mesh integrity.
Triangle decimation algorithms proceed by examining all vertices and edges in a mesh
and then prioritize their removal. This removal takes into account the vertex removal,
edge collapse, and half edge collapse operations. The vertex removal operation removes
the vertex from the matrix and inserts triangles back into the mesh, with a resultant loss
of two triangles. The edge collapse operation merges two vertices into a single vertex by
creating a new average vertex, redirecting all previous edges to the new vertex such that
they are unique, and then deleting the two older vertices. The last operation is the half edge
collapse, which is identical to the edge collapse, except that one of the vertices is anchored
in it’s position, and all edges from the second vertex are redirected to the anchored vertex
uniquely. By prioritizing and incrementally performing these operations on the given data
set, the number of vertices in the volume can be greatly reduced.
Given the potential for medical imaging data sets to have upwards of 400K triangles
4. LANDMARK AND CONTROL POINT REGISTRATION 13
[source], it is highly desirable to decimate the created mesh. Failing to decimate the mesh
could result in an untenable problem, with far too many mesh points to perform mesh
matching and deformation in a reasonable amount of time.
4.4 Selective Control Point Activation
Given an ideal mapping between the mesh points in the meshes derived from the target
and original volumes, we can then run the TPS algorithm. After running the algorithm
each control point will then have a weight associated with it. The weights can then be
used to deactivate the control points and rerun the algorithm. By determining which
control points have sufficiently small weights, certain control points can then be deactivated.
By deactivating certain control points, the TPS algorithm can run faster, and obtain a
more generalized match to the target volume. In order to more quickly assess the cost of
deactivating certain control points, the control points TPS algorithm can be run against
only the landmarks in the original volume to see how well they match with the target
volume. This is an optimization problem, as is the bspline non-linear optimisation, however
it is a smaller problem, as the number of control points when used in non-random data
should be able to be greatly reduced, as local control points should enforce each other.
CHAPTER 5: IMPLEMENTATION
5.1 Overview
The program with which the thin-plate spline was tested was implemented in the Vala
[11] and C programming languages. As performance extensions, it also uses Intel’s Cilk
Plus [12], OpenMP 4, and AVX2/512 vectorization. The compiler used for all testing of
the ITK and native implementations was Intel’s C Compiler (icc), which shows significant
performance gains over the GNU Compiler Collection (gcc), which are gained through
auto-vectorization and intel-specific optimization. All performance critical code was written
in C for optimal runtime. Performance was profiled using a timing functions wrapped around
relevant sections. The program was configured to read and write volumetric images in the
MHA format. For both the CPU and MIC implementations, the code is parallelized with
OpenMP and vector instructions, however Xeon Phi has many more cores, and scales much
better to applications where there may be hundreds of threads present, as each core is able
to handle four threads at a time. As there can be a difference in performance, it should be
noted that the interpolation method used at the destination points of the resultant volume
is the trilinear interpolation.
5. IMPLEMENTATION 15
5.2 CPU Variation
The CPU implementation of the thin-plate spline algorithm has no significant changes over
the standard implementation for the CPU. The implementation is native, as to get the best
performance out of the CPU. The CPU implementation made use of all available options,
aside for the Xeon Phi and fast math instructions. Two models of CPUs were used. The
CPU present on the system with the Xeon Phi is the Intel Xeon E5-2609 at 2.50GHz.
The other CPU is the Intel Core i5-4670K, which was chosen for compatibility with AVX2
instructions. The main advantage of the CPU over the Phi is that it has a lower cold startup
time, as memory can be directly accessed from main memory, where as the Phi will need to
load that memory over the PCIe system bus before acting on it. Given that the operation
isn’t especially memory intensive, with a 512x512x128 volume in memory consuming only
128MiB, the bottleneck in the volume warp is necessarily the calculation of the per-voxel
warp distance.
5.3 MIC Variation
Given the unique architecture of the Xeon Phi architecture, it is necessary to use a many-
thread, highly vectorized implementation. Toward this end, code compiled for the Xeon Phi
must take advantage of this vectorization in order to surpass the performance of the CPU.
The Xeon Phi model that is running in the test system is the Xeon Phi 3120A, which has
57 cores, each with a 512-bit vector unit. Each core on the Xeon Phi may run at 1.1GHz
with up to four in-order threads. When using the Xeon Phi, there are many similarities
to using CUDA, in that data must be oﬄoaded to the device and then calls are made to
5. IMPLEMENTATION 16
Listing 5.1: OpenMP Parallelization
1 void warp_volume(float *input , float *output , float *weight_x , float *
weight_y , float *weight_z , long int control_points , long int *dim ,
float *coeffs , float span , float *spacing , float *origin)
2 {
3 .. .
4 ...
5 #pragma omp parallel for \
6 collapse (3) \
7 default(none) \
8 private (...) \
9 shared (...)
10 for(int x = 0; x < dim [0]; x++)
11 {
12 ...
13 for(int y = 0; y < dim [1]; y++)
14 {
15 ...
16 for(int z = 0; z < dim [2]; z++)
17 {
18 // initial points
19 ...
20 for(int i = 0; i < control_points; i++)
21 {
22 ...
23 }
24
25 trilinear_interpolate(input , output , dim , sxyz);
26 }
27 }
28 }
29 }
5. IMPLEMENTATION 17
Listing 5.2: Vectorization of Control Point loop
1 void warp_volume(float *input , cloat *output , float *weight_x , float *
weight_y , float *weight_z , long int control_points , long int *dim ,
float *coeffs , float span , float *spacing , float *origin)
2 {
3 ...
4 float s0 = 0;
5 float s1 = 0;
6 float s2 = 0;
7 ...
8 for(int x = 0; x < dim [0]; x++)
9 {
10 ...
11 for(int y = 0; y < dim [1]; y++)
12 {
13 ...
14 for(int z = 0; z < dim [2]; z++)
15 {
16 ...
17 #pragma omp simd reduction (+:s0 ,s1 ,s2)
18 for(int i = 0; i < n; i++)
19 {
20 scale = ...
21 s0 += scale * weight_x[i];
22 s1 += scale * weight_y[i];
23 s2 += scale * weight_z[i];
24 }
25
26 trilinear_interpolate(input , output , dim , sxyz);
27 }
28 }
29 }
30 }
5. IMPLEMENTATION 18
Listing 5.3: Xeon Phi Oﬄoading
1 void warp_volume(float *input , cloat *output , float *weight_x , float *
weight_y , float *weight_z , long int control_points , long int *dim ,
float *coeffs , float span , float *spacing , float *origin)
2 {
3 ...
4 #pragma offload target(mic) \
5 in(input : length (0) REUSE) \
6 in(weight_x : length(n+4) \
7 in(weight_y : length(n+4) \
8 in(weight_z : length(n+4) \
9 in(coeffs : length(n*6)) \
10 out(output : length(lim))
11 {
12 for(int x = 0; x < dim [0]; x++)
13 {
14 ...
15 for(int y = 0; y < dim [1]; y++)
16 {
17 ...
18 for(int z = 0; z < dim [2]; z++)
19 {
20 ...
21 for(int i = 0; i < n; i++)
22 {
23 scale = ...
24 s0 += scale * weight_x[i];
25 s1 += scale * weight_y[i];
26 s2 += scale * weight_z[i];
27 }
28
29 trilinear_interpolate(input , output , dim , sxyz);
30 }
31 }
32 }
33 }
34 }
the device. One key difference between the CUDA architecture and the Xeon Phi is that
all Xeon Phi code may be written with inline C code and oﬄoad calls are handled with
#pragma offload ... calls.
5. IMPLEMENTATION 19
5.4 Fast Math
After profiling the running application on the CPU, the piece of the application that con-
sumed the greatest portion of time was the log function from the math library. Knowing
this, the log function provided with the math library was replaced with the fastlog func-
tion [13]. This function introduces little additional error, but cut runtime by two thirds.
This method, as authored by Paul Minerio, worked for simpler SIMD instructions sets, but
broke with more advanced sets like AVX. It is possible that these fastmath extensions could
be reworked in order to obtain the same speedup with vector processors.
5.5 Vectorization
Vectorization can be gained on a variety of systems through a few sets of specialized in-
struction sets. When working on the CPU, the degree of vectorization is more limited,
however with newer processor implementations, this could change. The processor used in
the test system was capable of SSE, SSE2, SSE3, SSE4.1, SSE4.2, and AVX. The Xeon
Phi’s vector processing units are capable of 512-bit vector processing, thereby handling up
to 16 single-precision calculations at a time. The vectorization is most useful for assessing
the warp distance for each control point, as this operation is independent for each control
point and each voxel, and requires no conditional blocks outside of the loop control.
5.6 Cilk
While Cilk Plus is not yet well supported across compilers, using a recent compiler (icc
15.0.3), the Cilk Plus platform allows for a vectorization-independent description of code
flow. Based on the hardware capability that the CPU has, the Cilk Plus platform is able to
5. IMPLEMENTATION 20
provide customized vectorization. As such, a single AVX512 instruction could be decom-
posed into two AVX/2 instructions, or four SSE instructions. This helps to promote code
reuse, and allows the compiler to handle minor optimization details.
CHAPTER 6: RESULTS
Results from from running the program show that the warp was successful, obtained a
substantial speed increase over the single threaded implementation, and an even more sub-
stantial speed increase on the Xeon Phi. Performance and accuracy were quantitatively
assessed by using 4DCT scans with human placed landmarks. Additionally, a volume of
the same size as was used in [7] is provided to be used as a rough timing comparison to an
industrial CUDA card.
For the purpose of comparing the difference between different architectures performance
for the thin-plate spline, a Voxel Control Point Throughput (VCPT) metric is used. This
is determined by dividing the product of the voxel count and control point count with the
runtime of the algorithm. This is a viable metric for measurement, as the voxel count and
control point count adjustment will have a similar effect on all platforms and can be used
as a good predictor of how an implementation will scale when given greater numbers of
control points or voxels.
Accuracy was not noticeably effected by the differences between the Xeon Phi and the
CPU implementation as can be seen in the images below, and in 6.1. The major points
of difference between the Xeon Phi, Xeon multi-threaded, and Xeon single-threaded was
mostly in the use of the fastmath package. The difference, in this instance, caused by using
the fastmath package was nearly less than %1 of the total error value between the target
6. RESULTS 22
Variation Time (s) MSE Size (x, y, z, points) VCPT
Xeon Phi 0.800 0.000297 512x512x128x139 5.830e9
Xeon w/ AVX Multi-Threaded 6.160 0.000297 512x512x128x139 7.571e8
Xeon w/ AVX Single-Threaded 24.931 0.000299 512x512x128x139 1.870e8
Core i5 w/ AVX2 Single-Threaded 7.817 0.000301 512x512x128x139 5.965e8
Core i5 w/ AVX2 Multi-Threaded 1.804 0.000301 512x512x128x139 2.584e9
Xeon Phi 0.132 - 256x256x94x72 3.360e9
Xeon Multi-Threaded 0.866 - 256x256x94x72 5.121e8
Xeon Single-Threaded 3.548 - 256x256x94x72 1.250e8
(Xeon) ITK Single-Threaded 14.35 - 512x512x128x42 9.820e7
(Xeon) Native Single-Threaded 7.66 - 512x512x128x42 1.839e8
CUDA (Tesla C2050 [7]) 0.252 - 256x256x94x75 1.833e9
Figure 6.1: Showing the runtime for volumes of different volume sizes and architectures and
the corresponding VCPT.
and warped image.
As far as performance of the transform is concerned, the Xeon Phi, at both volume
sizes, showed a significant improvement in performance over the CPU implementation.
Given the nature of the MIC architecture, it is unsurprising that the larger volume showed
a much greater improvement in speed, as the overhead of organizing the cores is much more
significant for smaller data sets than for large ones.
The implementation on the Xeon Phi seems to have much improved performance over
the CUDA implementation seen in [7]. Unfortunately, due to hardware unavailability it is
impossible to test both on the same system, and the best that can be done is an estimate.
Both are supercomputer-grade hardware, with both being rated at greater than 1Tflops per
device.
6. RESULTS 23
(a) (b)
(c) (d)
Figure 6.2: Different views of the right lung from the body cavity. Control points are marked
in yellow as the current points, and red as the target points. The majority of the movement
comes from the diaphram.
6. RESULTS 24
(a) Original (b) Warped (c) Target
Figure 6.3: Display of original, warped and target lung slices. Note that in the warped slice
(b), the diaphragm has been adjusted to match the target (c).
(a) Un-warped (b) Warped
Figure 6.4: Difference in an image slice of the body cavity before and after the thin-plate
spline warp.
6. RESULTS 25
Vectorization Width ST MT OpenMP Speedup Speedup over Scalar
None 1 60.474 15.577 3.882 1.000
SSE4.2 4 20.788 5.180 4.013 3.007
AVX 8 9.609 2.602 3.692 5.986
AVX2 8 7.180 1.804 4.333 8.634
Figure 6.5: Effect of vectorization and OpenMP on thin-plate spline warp.
An important note about the accuracy of warps is that while the thin-plate spline is
able to easily resolve large global errors as seen between figures 6.3 (a) and (b), it naturally
induces smaller local errors all over the volume. Unfortunately this is the nature of the
thin-plate spline. The control point pinning scheme that is present in this implementation
does help to prevent part of this movement by pinning areas of the two un-warped volumes
together prior to the warp. Unfortunately this pinning scheme restricts some of the global
movement by potentially prematurely pinning points.
Given the localized noise created by the deformation across the whole volume, the thin-
plate spline would have a much better output if it was instead run as a primary registration
technique to obtain a fast global correction to many features. Following this, a bspline
algorithm could be run in order to correct the induced local errors. Figure 6.4 shows the
diff between the target image and the transformed image, as well as the difference between
the target image and the untransformed image. This clearly demonstrates the way in which
the thin-plate spline is able to correct large errors with in the volume with a continuous
gradient, however the area that is corrected is left with noise. This is the noise that could
be fixed easily with an iterative bspline, but would require hundreds more thin plate spline
control points.
6. RESULTS 26
Time Error
ST MT ST MT
Norm 70.859 19.013 Norm 0.000291 0.000291
Fast 33.761 8.459 Fast 0.000294 0.000294
Figure 6.6: Comparison between time and error when switching between math.h’s logf
function, and the fastlog function.
On the CPU, the effect of vectorization on the thin-plate spline volume warp is much
more pronounced. Using a 512x512x128 volume with 138 control points, the speedup of
vectorization is determined by testing the thin-plate spline warp with no vectorization,
SSE4.2, AVX, and AVX2, as seen in table 6.5. The increased level of vectorization shows
a linear trend against run time, which suggests that the efficiency of vectorization on the
Xeon Phi will be similar. One way in which this will change on the Core i5 against the
Xeon Phi, is that the Xeon Phi handles instructions in-order, uses a 512-bit vector unit,
and has four threads per core. Regardless, this shows that the thin-plate spline algorithm,
at least on the CPU, scales ideally with the size of the vector unit.
Finally, the fastlog function was able to provide significant performance improve-
ments, with little accuracy loss. Table 6.6 shows that by switching to using the fastlog
function, the performance of the thin-plate spline doubled, both for the single-threaded and
multi-threaded case. Additionally, error values were not significantly impacted with this
change. Visibly examining the volumes, there is no discernible difference to the human
eye. The fastmath package [13] does have SSE extensions available, however there are no
extensions available for AVX or AVX2, and so these were tested without vectorization of
the fastlog function.
CHAPTER 7: CONCLUSIONS
The ideas and techniques presented in this thesis show that a thin-plate spline algorithm can
greatly benefit from the use of heavily vectorized hardware. The Xeon Phi was shown here
to be able to match and surpass other Teraflop-grade hardware in similar form factors. Fur-
thermore, the thin-plate spline algorithm can benefit form this acceleration using standard
vectorization and threading practices to accelerate the CPU versions of thin-plate spline.
Measurements on architectures with varying capabilities for vectorization show that heavier
vectorization has little diminishing effect on the performance of the thin-plate spline, and
can even help newer and lighter architectures to outperform architectures that are just a
few years older, yet sport many times more compute cores. In table 6.1, the VPCT metric
showed that the Core i5-4670K from 2014 was able to outputform the NVidia Tesla C2050
from 2010 by a margin of 40%.
The accuracy of the large scale correction of volume images also shows that the thin-
plate spline is capable of functioning as its own primary registration technique. While the
thin-plate spline is able to handle massive changes relatively easily and with few supporting
points, the thin-plate spline also introduces noise into the deformed image. This noise could
then be corrected with a more localized registration technique.
Another introduced feature to improve performance of the transform was the use of “fast
math” functions. These fast math functions allow an approximation of the math library’s
7. CONCLUSIONS 28
log() function at a third of the cost. Given that the log() function has an approximate
cost of 60 float point addition operations or 15 float point multiplication operations. Using
this estimation, the throughput of the thin-plate spline on the CPU is increased by 3x with
little cost to accuracy.
CHAPTER 8: FUTURE WORK
Future work will center around thin plate spline control point selection. While the Xeon Phi
has shown to have significantly greater performance than a CUDA implementation, neither
implementation looks at control point selection, instead preferring to use landmarks placed
by humans, as the alternative leads to a global optimization and search problem. Often
times, the thin plate spline is used as an attempt to clean up runs of the bspline algorithm.
In the bspline algorithm, localized changes are taken care of very quickly and very well,
however it can largely miss more global changes. The thin-plate spline is then used to clean
up these global changes.
Give the potential for global changes, the thin-plate spline could be very useful as an
initial, automated, pass, as opposed to the bspline algorithm. To this end, certain key
information could be extracted from the volumes in order to propose landmarks to the
thin-plate spline algorithm. By using the thin plate spline first with this key information
mapped to each other, the thin-plate spline could be used to obtain a rough global mapping
between the two volumes. The bspline algorithm could then be run to clean up the localized
deformation.
To obtain key information about the volume, a canny filter or some thresholding filters
corresponding to special ranges in the volume could be run. This information would then
be a valid metric for MSE or MI heuristics to solve a lesser optimization problem. Unless
8. FUTURE WORK 30
there is no correspondence between the filtered ranges in each volume, the filter would
select similar ranges between each image. Alternatively, MI heuristics could be used to
select ranges of similarity between the two volumes for guided filtering.
With the filtered image, a mesh could be obtained by using the marching cubes algorithm
[14]. The marching cubes algorithm would then reduce the image data into a triangle mesh.
This is useful because while the volume data as voxels cannot be simplified into regions of
similarity, mesh data can be simplified to a desired level using a progressive mesh. This
progressive mesh is able to choose between different levels of detail based of filtering criteria.
This level of detail selection is important because the thin-plate spline has a linear relation
between runtime and the number of control points. Using a scoring metric like MSE or
MI for each of the points in the mesh, the level of detail could then be adjusted until the
necessary number of control points is determined.
To complete the global thin-plate spline matching, a mesh matching can be used on
the obtained volumetric meshes. The mesh matching technique, like the image registration
itself, is an optimization problem, however by using a mesh, the size of the problem is greatly
reduced, which could make the problem much more feasible, particularly if the progressive
mesh is used to gradually increase the level of detail in the mesh.
Another brief matter for future work will include extending the fastmath package to
enable AVX and AVX2 extensions. This would be interesting, because given the metrics in
table 6.5 and 6.6, the addition of a fast log function with AVX/2 instructions would allow
a Core i5 processor to nearly match the Xeon Phi Coprocessor, assuming that the vector
version is able to match the performance gains of the scalar version.
LIST OF REFERENCES
[1] D. L. G. Hill, P. G. Batchelor, M. Holden, and D. J. Hawkes, “Medical image registra-
tion,” Physics in Medicine and Biology, vol. 45, pp. R1 – R45, 2000. 4
[2] J. B. A. Maintz and M. A. Viergerver, “A survey of medical image registration,”
Medical Image Analysis, vol. 2, no. 1, pp. 1–36, 1988. 4
[3] J. Shackleford, N. Kandasamy, and G. Sharp, “On developing b-spline registration
algorithms for multi-core processors,” Physics in Medicine and Biology, vol. 55, p. 6329,
2010. 4, 5
[4] F. L. Bookstein, “Principal warps: Thin-plate splines and the decomposition of defor-
mations,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 11,
pp. 567–585, 1989. 4
[5] G. Donato and S. Belongie, “Approximate thin plate spline mappings,” Lecture Notes
in Computer Science, vol. 2352, pp. 21–31, 2002. 5
[6] J. Shackleford, N. Kandasamy, and G. Sharp, GPU Computing Gems Emerald Edition,
Chapter 47. Morgan Kaufmann Publishers Inc., 2011. 5
[7] W. Luo, X. Yang, X. Nan, and B. Hu, “Gpu accelerated 3d image deformation using
thin-plate splines,” 2014. 5, 21, 22
[8] NVIDIA, NVIDIA CUDA Complete Unified Device Architecture Programming Guide,
first ed., 2007. 5
[9] Intel, Intel Xeon Phi Coprocessor System Software Developers Guide, 2014. 5
[10] Intel, Intel Math Kernel Library for Linux OS, 11.2 ed., 2015. 5
[11] “Vala - compiler for the gobject type system,” 2015. https://wiki.gnome.org/Project-
s/Vala. 14
[12] “Cilkplus.” https://www.cilkplus.org. 14
[13] P. Mineiro, “Fast approximate logarithm, exponential, power, and inverse root,” Ma-
chined Learnings, 2011. 19, 26
[14] W. E. Lorensen and H. E. Cline, “Marching cubes: A high resolution 3d surface con-
struction algorithm,” Computer Graphics, vol. 21, no. 4, 1987. 30
VITA
Richard Andrew Benton was born in Wyomissing, PA in 1991. Andrew graduated from
Central Bucks West High School in 2010 and began a Bachelors of Computer Engineering at
Drexel University in the same year. In his second year at Drexel University, he began a Co-op
at Susquehanna International Group working on business web application development. In
his third year at Drexel, he worked in an internship position for Micron Technology in
Product Engineering examing test and production data for DDR3 DRAM components.
In this same year he converted his program at Drexel into an accelerated Bachelors and
Masters program. During his fourth year, he again returned to Micron Technology for an
internship in Product Engineering, this time working on the emerging technology, Hybrid
Memory Cube.

