An easily-implemented, block-based fast marching method with superior
  sequential and parallel performance by Yang, Jianming
AN EASILY-IMPLEMENTED, BLOCK-BASED FAST MARCHING
METHOD WITH SUPERIOR SEQUENTIAL AND PARALLEL
PERFORMANCE∗
JIANMING YANG†
Abstract. The fast marching method is well-known for its worst-case optimal computational
complexity in solving the Eikonal equation, and has been employed in numerous scientific and en-
gineering fields. However, it has barely benefited from fast-advancing multi-core and many-core
architectures, due to the challenges presented by its apparent sequential nature. In this paper,
we present a straightforward block-based approach for a highly scalable parallelization of the fast
marching method on shard-memory computers. Central to our new algorithm is a simplified restarted
narrow band strategy, with which the global bound for terminating the front marching is replaced
by an incremental one, increasing by a given stride in each restart. It greatly reduces load imbalance
among blocks through a synchronized exchanging step after the marching step. Furthermore, sim-
ple activation mechanisms are introduced to skip blocks not involved in either step. Thus the new
algorithm mainly consists of two loops, each for performing one step on a group of blocks. Notably,
it does not contain any data race conditions at all, ideal for a direct loop level parallelization using
multiple threads. A systematic performance study is carried out for several point source problems
with various speed functions on four grids with up to 10243 points using two different computers.
Substantial parallel speedups, e.g., up to 3–5 times the number of cores on a 16-core/32-thread
computer and up to two orders of magnitude on a 64-core/256-thread computer, are demonstrated
with all cases. As an extra bonus, our new algorithm also gives improved sequential performance
in most scenarios. Detailed pseudo-codes are provided to illustrate the modification procedure from
the single-block sequential algorithm to the multi-block one in a step-by-step manner.
Key words. Eikonal equation, Fast marching method, Narrow band approach, Domain decom-
position, Parallel algorithm, Shared-memory parallelization
AMS subject classifications. 35F30, 35L60, 49L25, 65N06, 65N22, 65K05
1. Introduction. The fast marching method [10, 11, 12] solves the stationary
boundary value problem defined by the Eikonal equation:
(1.1)
|∇ψ(x)|F (x) = 1, x ∈ Ω \Γ,
ψ(x) = 0, x ∈ Γ ⊂ Ω,
where Ω is a domain in Rn, Γ is the initial boundary, and F (x) is a positive speed
function, with which the boundary information propagates away along characteristics
in the domain. This equation has been used in a variety of applications, such as com-
putational geometry, computer vision, optimal control, computational fluid dynamics,
materials science, etc. The fast marching method is a non-iterative algorithm that
closely resembles Dijkstra’s method [4] for finding the shortest path on a network.
It was first proposed by Tsitsiklis [13] using an optimal control approach. Sethian
[10] derived the algorithm based on an upwind difference scheme and introduced the
name “fast marching”, which has since become a very popular method for solving the
Eikonal equation.
The success of the fast marching method stems from its simplicity and efficiency.
The discretization of the Eikonal equation using upwind difference schemes is straight-
forward. At a given grid point, the finite difference stencil contains only upwind
neighbors and the causality of the equation is strictly followed. In terms of data
structure, it only involves a heap priority queue, which is used to march the solution
∗Submitted to the editors September 10, 2018.
†Fidesi Solutions LLC, P.O. Box 734, Iowa City, IA 52244, USA (jmyang@fidesisolutions.com).
1
ar
X
iv
:1
81
1.
00
00
9v
1 
 [p
hy
sic
s.c
om
p-
ph
]  
31
 O
ct 
20
18
2 J. YANG
in a rigorous increasing order. In the fast marching method, the number of times that
a point is visited is minimized and no iterations are involved in the one-pass solution
updating process. For a case with N grid points, the fast marching method has a
worst-case optimal algorithm complexity of O(N logN), as the run-time complexity
of reordering of a heap of length n is O(log n).
Several studies have been reported to improve the O(N logN) run-time complex-
ity to obtain O(N) complexities for solving the Eikonal equation in a single pass. In
particular, Tsitsiklis proposed a O(N) approach using a bucket data structure along
with his O(N logN) Dijkstra-like method [13]. Since insertion and deletion of a node
in a bucket structure takes only O(1) computations instead of O(log n) in the binary
heap implementation of a priority queue, the complexity of the whole algorithm was
reduced to O(N). A shared-memory parallel implementation was also provided. In
[16], an O(N) implementation of the fast marching method was developed with the
help of untidy priority queue that does not distinguish the priorities of the points
within the same bucket. However, the quantization of the priorities introduces an
extra error of the same order of magnitude of the discretization scheme. The O(N)
methods in [13] and [16] are single-pass approaches like the original fast marching
method. In contrast, the group marching method developed in [8] obtained an O(N)
complexity without the use of sorting data structures. In this method, essentially, a
group of points in a narrow band from the vicinity of the marching front are set to be
valid at the same time through two passes of solution updating. However, these O(N)
algorithms seem to have received less attention than other fully iterative algorithms
of O(N) complexity, such as the fast sweeping methods [17] and the fast iterative
methods [7].
A pronounced feature of iterative methods is that they are highly susceptible
to different parallelization strategies, either coarse grained [18, 5, 2] or fine grained
[7, 14, 3]. On the contrary, the fast marching method has no similar straightforward
parallelism as found in these iterative methods. Actually, it was long deemed inher-
ently sequential due to the use of a heap priority queue, with which there is only
one single point in the whole domain to be accepted as valid at a time. A few at-
tempts to acheive parallel implementations using domain decomposition techniques
were reported in the literature [6, 1], but generally with limited success until recently.
In [15], a highly scalable massively parallel algorithm of the fast marching method
was established through a rigorous systematical test. In this method, essentially, a
slightly modified sequential fast marching algorithm was run on each process for one
subdomain. A novel restarted narrow band approach was introduced to coordinate
the synchronization of front propagation among subdomains, such that the the fre-
quency of inter-process communications and the number of points to be refreshed with
incoming front characteristics could be balanced to minimize the total cost. Several
other means, such as extended data structures for interface problems, and augmented
tags for point status, were also introduced to streamline the algorithm. This method
was based on an overlapping domain decomposition technique for distributed-memory
parallelization and developed for large-scale practical applications of today using bil-
lions of grid points on hundreds of thousands of processes. However, it is still hindered
by the load imbalance inherent to the solution process of the Eikonal equation in a
stationary domain decomposition setting. That is, a process might stay idle until the
advancing front engages the subdomain assigned to it because of the fixed one-to-one
process-subdomain relationship. In addition, the MPI-based parallel algorithm could
not take full advantage of the shared-memory architecture prevalent in multi-core
and/or many-core processing.
FAST MARCHING METHOD WITH SUPERIOR PERFORMANCE 3
In this paper, we address these issues with a fresh view on the sequential and par-
allel fast marching algorithms mentioned above, and propose a block-based approach
with superior sequential and parallel peformance. This approach follows the original
sequential algorithm nearly exactly for a single block, obtains mostly much improved
performance immediately with multiple blocks, and almost alway realizes a parallel ef-
ficiency way above unity with multiple threads working on these blocks. The restarted
narrow band strategy [15] is still central to the whole process, as it is fully consistent
with the narrow band fast marching method. Nonetheless, a non-overlapping domain
decomposition technique is adopted. Compared with the overlapping approach used
in [15], the non-overlapping approach complies better with the sequential algorithm
and facilitates completely local computation in each subdomain. Thus the whole al-
gorithm does not involve any race conditions, which usually require the introduction
of lock synchronization, a common performance bottleneck in shared-memory paral-
lelization. Moreover, simple block activation-deactivation mechanisms are designed
to determine if a block should be included in the narrow band marching step or the
ghost cell data exchanging step. Therefore, the load imbalance issues prominent in a
stationary domain decomposition setting are substantially mitigated. Starting with
a standard sequential fast marching algorithm, we introduce small modifications on
top of it in a step-by-step manner without altering any algorithm essentials. Together
with systematic tests on different platforms, an illustrative approach is undertaken to
provide a dependable benchmark parallel algorithm for the sequential fast marching
method.
2. Numerical Methods. In this part, a color code is used in the pseudo-codes
to illustrate the incremental modifications introduced in each Subsection: 1) black,
for the original sequential fast marching method in Sec. 2.1; 2) red, for augmented
status tags and the unified heap in Sec. 2.2; 3) green, for the non-overlapping domain
decomposition technique in Sec. 2.3; 4) blue, for the restarted narrow band strategy
in Sec. 2.4; 5) cyan, for the block activation mechanism for the marching step; and
6) magenta, for the block activation mechanism for the exchanging step. The shared-
memory parallelization using OpenMP threads is not given here, because it is fairly
straightforward based on the current pseudo-codes, and particular OpenMP directives
depend on the programming language used for the implementation. Also, for brevity,
interface problems are not discussed here, but would be treated exactly the same as
presented in [15].
2.1. Sequential fast marching method. To simplify the discussion, we use a
uniform grid with a cell size of ∆h in each direction to discretize the physical domain
Ω, although the methodology is by no means restricted to such a uniform grid. As
shown in Fig. 1(a), the unknown variable, ψ, is defined at the cell center (i.e., the
grid point) and the grey-shaded zones are filled with ghost cells. For conenience, the
ghost cell zones are collectedly named Θ and Ξ = Ω + Θ as the combination of the
discretized physical domain and ghost cell zones. The following quadratic equation
for ψi,j,k can be obtained using the Godunov-type upwind finite difference scheme [9]
to approximate Eq. (1.1):
(2.1)
(
max(D−xi,j,kψ,−D+xi,j,kψ, 0)
)2
+
(
max(D−yi,j,kψ,−D+yi,j,kψ, 0)
)2
+
(
max(D−zi,j,kψ,−D+zi,j,kψ, 0)
)2 =
1
F 2i,j,k
,
4 J. YANG
where the operators D−xi,j,k and D
+x
i,j,k define the backward and forward first-order
difference approximations to the spatial derivative ∂ψ/∂x, respectively:
(2.2) D−xi,j,kψ =
ψi,j,k − ψi−1,j,k
∆h
, D+xi,j,kψ =
ψi+1,j,k − ψi,j,k
∆h
.
The y and z directions are treated similarly.
From Eq. (2.1) it is apparent that the solution of ψi,j,k only relies on the neigh-
boring points of smaller value, or upwind points. The fast marching method takes
advantage of this fact by following a systematical way of solving Eq. (2.1) in the
computational domain starting from the point with the smallest value. To establish
the order of updating, the grid points are divided into three categories: i) KNOWN,
which contains points that are part of the boundary condition or considered to have
final solutions; ii) BAND, which contains points that have KNOWN neighbor(s), thus their
solutions can be updated by solving Eq. (2.1); and iii) FAR, which contains points
that do not have any KNOWN neighbors yet, thus it is unnecessary to solve Eq. (2.1)
on them. For a new KNOWN point, the solutions of its BAND and FAR neighbors can
be updated through Eq. (2.1). Then the point that has the smallest value in the
BAND set is removed from the BAND set to become a new KNOWN point, which clearly
has the largest solution in all KNOWN points. This loop continues until all points are
KNOWN or the solution of the last KNOWN point reaches a pre-set threshold. In order to
maintain a strict order of increasing values in this process, a binary heap structure is
usually adopted to sort points in the BAND set [12]. Several standard heap operations
are involved, e.g., Insert Heap adds an element into the heap, Locate Min gives the
index of the top element on the grid, Remove Min deletes the top element from the
heap, Up Heap moves an element up in the heap, etc.
Algorithm 2.1 Heap initialization:
Initialize Heap(ib).
1: ψib ← +∞
2: Gib ← FAR
3: size(Hib)← 0
4: Aib ← FALSE
5: for all (i, j, k) ∈ Ξib do
6: if (i, j, k) is on or adjacent to Γ then
7: ψibi,j,k ← ψ0
8: Gibi,j,k ← KNOWN FIX
9: Update Neighbors(i, j, k, ib) Insert Heap(i, j, k,Hib)
10: Aib ← TRUE
11: end if
12: end for
Therefore, this first step of the narrow band fast marching algorithm is the ini-
tialization of the heap structure, as shown in Algorithm 2.1. In this work, a status
tag G is used to label the category that a point (i, j, k) belongs to. All points in Ξ
are initialized with a function value +∞ and a status tag FAR. The initial heap size
is set to be zero. Then, the grid points on or immediately adjacent to the boundary
are specified with initial function values ψ0 and marked as KNOWN. The neighbors of
a KNOWN point can be updated as shown in Algorithm 2.2. Here only the neighboring
points within Ω that are not KNOWN are checked by solving the quadratic equation us-
ing Algorithm 2.3. If the solution satisfies the causality condition and is smaller than
FAST MARCHING METHOD WITH SUPERIOR PERFORMANCE 5
the current value, it can be accepted. Correspondingly this point is tagged as BAND
and inserted to the heap or moved up in the heap if it was added earlier. After the
initialization is completed, the front can be marched in a loop as shown in Algorithm
2.4. Obviously, if the heap is empty or if the top element in the heap has a value
larger than the preset narrow band bound, widthband, the marching of front should
be terminated. Otherwise, the top element is removed from the heap and tagged
as KNOWN. Then its neighbors can be updated as discussed above. The operations
within the loop repeat until one of the exit conditions is met. For more details of the
sequential narrow band fast marching method, the reader is referred to [10, 11, 12]
and the references therein.
Algorithm 2.2 Update neighbors of a point just removed from the heap:
Update Neighbors(i, j, k, ib).
1: for all (l,m, n) such that (|l − i|+ |m− j|+ |n− k|) = 1 do
2: if (l,m, n) ∈ Ωib then
3: if Gibl,m,n /∈ KNOWN FIX and ψibl,m,n > ψibi,j,k then
4: ψtemp ← Solve Quadratic(l,m, n, ib)
5: if ψtemp < ψ
ib
l,m,n then
6: ψibl,m,n ← ψtemp
7: Gibl,m,n ← BAND
8: if (l,m, n) 6∈ Hib then
9: Insert Heap(l,m, n,Hib)
10: else
11: Up Heap(l,m, n,Hib)
12: end if
13: if (l,m, n) ∈ Πib then
14: Υib ← Υib ∪ {(l,m, n)}
15: end if
16: end if
17: end if
18: end if
19: end for
2.2. Augmented status tags and unified heap. In [15], augmented status
tags were introduced to distinguish the status of a point in different stages of the par-
allel fast marching method. In particular, BAND and KNOWN points in the overlapping
zones were labelled as NEW when their values were obtained from the quadratic solver
and OLD when they were collected to be sent to neighboring subdomains, respectively.
This was beneficial for reducing the data size involved in communications as well as
repetitive computations for the distributed-memory parallelization with overlapping
domain decomposition. In this work, thanks to the shared-memory setting, such a dis-
tinguishment is found to be generally indifferent in terms of performance. Therefore,
we only separate the KNOWN category into two subsets: a) KNOWN FIX, which contains
the points initialized as boundary condition with fixed function values during the
solution process; and b) KNOWN NEW, which contains the rest KNOWN points defined in
the original sequential fast marching method. This tag augmentation is still neces-
sary, because a KNOWN NEW point in one subdomain may require recomputation due to
incoming characteristics from neighboring subdomains [15]. Likewise, the boundary
points are labeled as KNOWN FIX instead of merely KNOWN in the heap initialization step
6 J. YANG
as shown in Algorithm 2.1.
In this work, a modification of the heap initialization step given in [15] is that the
boundary points labeled as KNOWN FIX are also inserted into the heap now, showing as
replacing Update Neighbors with Insert Heap in Algorithm 2.1. Correspondingly,
it is necessary to verify that the top element in the heap is not in the KNOWN FIX set
before labeling it as KNOWN NEW in Algorithm 2.4. Therefore, regardless of their status
all points are put into the heap. Such a unified heap treatment slightly increases
the computational cost due to filling the initial heap with some KNOWN FIX elements.
As a direct result, however, Update Neighbors will be called only once and the code
structure can be much simplified when block activation mechanisms are involved later.
Moreover, with the small changes up to this point the whole algorithm still strictly
follow the original sequential algorithm; that is also why we treat them separately in
this part.
2.3. Block-based approach. Here we introduce a simple block-based approach
for the fast marching method. For simplicity, each direction is evenly divided into
mBlocks pieces, which gives a total of nBlocks = mBlocks3 subdomains. As shown
in Fig. 1(b) for a 2D case, each subdomain are patched with ghost cell zones in both
the lower and upper ends of each direction, just like the undecomposed domain in
Fig. 1(a). For an arbitrary subdomain ib, it will have its own heap, which can be
initialized and operated independently. Although ψ and G can be defined globally
in a shared-memory setting, the algorithm on each subdomain follows the original
sequential algorithm closer with locally defined variables. Moreover, a thread only
handles a grid block instead of the whole domain at any time. With a limited amount
of cache memory, the cache miss rate would be much lower for data structures of
smaller size, hence the cache performance could be improved too. Therefore, as shown
in the Algorithms discussed above, we add superscript ib to Ω, Θ, ψ, G, and the heap
H to emphasize that they are defined locally in each subdomain.
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ! !
! ! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! !
! ! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! !
! ! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! !
! ! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! !
! ! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! !
! ! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! !
! ! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! !
! ! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! !
! ! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! !
! ! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! !
! ! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! !
! ! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! !
! ! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! !
! ! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! !
! ! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! !
! ! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! !
! ! ! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ! ! ! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! !
⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅!
⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅!
⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅!
⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅!
⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅!
⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅!
⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅!
⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅!
! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ! ! ! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ! ! ! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! !
⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅!
⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅!
⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅!
⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅!
⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅!
⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅!
⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅!
⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅!
! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ! ! ! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! ⋅! !
(a) (b)
Fig. 1. Computational domain: (a) single domain with ghost cell zones; (b) decomposed domain
with ghost cell zones for data exchanges between non-overlapping subdomains.
Algorithm 2.6 gives the main function for the block-based narrow band fast march-
ing method. Apparently, all function calls are embedded in the for loop in order to
be executed in each subdomain. As a simple way to run the sequential algorithm to
FAST MARCHING METHOD WITH SUPERIOR PERFORMANCE 7
Algorithm 2.3 Solve the quadratic equation:
Solve Quadratic(l,m, n, ib).
1: nd← 0
2: Check the x direction for upwind neighbor:
3: d← 0
4: if Gibl−1,m,n ∈ KNOWN and ψibl,m,n > ψibl−1,m,n then
5: d← −1
6: end if
7: if Gibl+1,m,n ∈ KNOWN and ψibl,m,n > ψibl+1,m,n then
8: if d = 0 or ψibl+1,m,n < ψ
ib
l−1,m,n then
9: d← 1
10: end if
11: end if
12: if d 6= 0 then
13: nd← nd+ 1
14: ψnd ← ψibl+d,m,n
15: end if
16: Check the y direction for upwind neighbor:
17: · · · · · ·
18: Check the z direction for upwind neighbor:
19: · · · · · ·
20: Reorder ψ1, · · · , φnd such that ψ1 ≤ · · · ≤ φnd
21: id← 1
22: while id ≤ nd do
23: a← id
24: b←∑idi=1 ψi
25: c←∑idi=1 ψ2i − ∆h2F 2l,m,n
26: if (b2 − ac) ≥ 0 then
27: ψtemp ← b+
√
b2−ac
a
28: if id < nd and ψtemp > ψid+1 then
29: id← id+ 1
30: else
31: return ψtemp
32: end if
33: end if
34: end while
obtain correct results, Algorithm 2.4 is placed in an infinite loop. The major addition
in this loop not discussed above is the function that accesses the data of neighboring
subdomains and integrates them into the heap if applicable. For convenience, let Nib
represent the six neighboring blocks of block ib (left, right, front, back, bottom, and
top) and Πib represent the collection of points that are corresponding to the ghost
cell zones of its neighboring blocks. Therefore, as shown in Algorithm 2.5, Θib ∩Πnb
represents the portion of ghost cells in block ib that are within the physical domain
of block nb, If a point from nb have a smaller function value, then its value and status
can be copied to the corresponding point in block ib; this point will be added to Hib
or its position in Hib will be adjusted for a current heap element.
It is imperative that the infinite loop has a proper exit condition. Here we in-
8 J. YANG
troduce a logical variable A to indicate whether a subdomain ib is active for further
marching work (Aib = TRUE) or not (Aib = FALSE). As shown in Algorithm 2.1, Aib is
initialized as FALSE and set to be TRUE if the subdomain has an nonempty initial heap.
Then in Algorithm 2.4, same as the exit conditions for the sequential algorithm, when
the heap is empty or the top element in the heap is beyond widthband, this block
should be set inactive. On the other hand, a block has to be set active if its heap has
any elements less than the terminal bound after integrating the data from neighboring
blocks through ghost cells, as shown at the end of Algorithm 2.5.
It is also required to include the KNOWN NEW neighbors of a KNOWN point just re-
moved from the heap in the check for possible updates, i.e., only KNOWN FIX neighbors
can be excluded from the check as shown in Algorithm 2.2. In addition, the quadratic
equation is only solved at a neighbor with a larger value. Similar checks are imposed
on looking for upwind points in Algorithm 2.3. These comparisons of function val-
ues added here for bypassing unnecessary computations would be redundant in the
sequential algorithm, because points added to the KNOWN set follow a strict order of
increasing function values if only a single domain is involved. As discussed above
and in [15] with more details, the sequential algorithm has not been altered by the
introduction of the augmented status tags and these checks. However, they make the
block-based algorithm a very natural extension of the sequential one.
2.4. Restarted narrow band approach. With the above block-based ap-
proach, it is clear that the sequential fast marching algorithm will run in each block
until exit, then data will be exchanged among neighbors via ghost cells; and both the
marching and exchanging steps repeat until the global exit condition is met. This is
straightforward in concept and implementation, but the performance is unsatisfying
by any means. In [15], a novel restarted narrow band approach was proposed to ad-
dress the performance issue by replacing the terminal narrow band bound widthband
with an evolving one, i.e., boundband in Algorithm 2.4. It should be noted here that
the condition to label the block as inactive is still determined by widthband. On the
other hand, as shown in Algorithm 2.6, a free parameter stride is introduced to
advance the front by the extent of stride, or δs, in each run-through. Apparently,
with stride = widthband the previous block-based approach is fully restored.
Unlike [15], in which boundband was built on the smallest one among all top ele-
ments of the local heaps, here we choose a much simpler way to increase boundband
evenly with stride in each run-through. This simplification renders the global re-
duction operation of top heap elements unnecessary. More importantly, the second
marching step in [15] is not required here, which makes the restarted narrow band
approach become not only more compact and more efficient, but also more consistent
with the sequential algorithm.
2.5. Mechanism for activating the marching step. Due to the narrow band
nature of the fast marching method, the computational load is unevenly distributed in
space during different stages of the front marching process. With the current block-
based, restarted narrow band approach, it is conceivable that some blocks will be
free of load, either because the block has been fully updated (i.e., the local heap is
exhausted, or the terminal bound widthband is surpassed by the top element of the
local heap) or because the moving front has not reached this block yet. In this case,
as shown in Algorithm 2.4, the function call of the marching step will quit immedi-
ately without doing any real work. Therefore, a simple mechanism is introduced in
Algorithm 2.6 such that the marching step for a block is activated only if Aib = true.
Clearly this precondition won’t change the results, but the overheads of repeated
FAST MARCHING METHOD WITH SUPERIOR PERFORMANCE 9
Algorithm 2.4 Front propagation within the narrow band:
March Narrow Band(ib).
1: Υib ← ∅
2: loop
3: if size(Hib) = 0 then
4: Aib ← FALSE
5: exit loop
6: end if
7: (i, j, k)← Locate Min(Hib)
8: if ψibi,j,k > widthband boundband then
9: if ψibi,j,k > widthband then
10: Aib ← FALSE
11: end if
12: exit loop
13: end if
14: if Gibi,j,k /∈ KNOWN FIX then
15: Gibi,j,k ← KNOWN NEW
16: end if
17: Remove Min(Hib)
18: Update Neighbors(i, j, k, ib)
19: end loop
20: for all block nb ∈ Nib such that Θnb ∩Υib 6= ∅ do
21: Cnb(ib)← TRUE
22: end for
function calls can be avoided. It should also be noted that the only place beyond the
heap initialization that Aib could be assigned true is in Algorithm 2.5 as discussed
earlier.
2.6. Mechanism for activating the exchanging step. Designing a mecha-
nism for activating the exchange step is slightly more involved because each block
has six neighbors. Here we introduce a six-element logical variable C for each block
to indicate possible incoming characteristics from neighboring blocks. As shown in
Algorithm 2.6, Cib is initialized as FALSE at the beginning. The exchanging step is
required for a block if any of the six elements becomes TRUE in the marching step,
i.e., Algorithm 2.4. As mentioned earlier, Πib represents the set of points in block
ib that correspond to the ghost cells in the neighbors of block ib. Initialized to be
empty, Υib is the set of points within Πib that are updated through the quadratic
solver and collected in Algorithm 2.2. If any points in Υib and the ghost cells of a
neighboring block nb coincide, the element of Cnb corresponding to block ib should be
set to TRUE. Then in Algorithm 2.5, a neighboring block nb will be used to possibly
update the ghost cells of block ib only if the corresponding element of Cib has been
set to TRUE in the marching step. Similarly, Πib can be replaced by Υib to further
limit the comparison operations. If should be noted that Cib is reset to FALSE after
all possible ghost cell updates for block ib have been checked.
2.7. Shared-memory parallelization. It is evident that Algorithm 2.6 mainly
consists of three for loops. Therefore, a shared-memory parallelization is simply to
run these three loops with multiple threads. This can be easily achieved by using the
10 J. YANG
Algorithm 2.5 Integrate data in the ghost cell region:
Integrate Ghost Cell Data(ib).
1: for all block nb such that nb ∈ Nib and Cib(nb) = TRUE do
2: for all (i, j, k) such that (i, j, k) ∈ Θib ∩Πnb Υnb do
3: if ψibi,j,k > ψ
nb
i,j,k then
4: ψibi,j,k ← ψnbi,j,k
5: Gibi,j,k ← Gnbi,j,k
6: if (l,m, n) 6∈ Hib then
7: Insert Heap(l,m, n,Hib)
8: else
9: Up Heap(l,m, n,Hib)
10: end if
11: end if
12: end for
13: end for
14: Cib ← FALSE
15: if size(Hib) 6= 0 then
16: (i, j, k)← Locate Min(Hib)
17: if ψibi,j,k ≤ widthband then
18: Aib ← TRUE
19: end if
20: end if
parallel do directive in Fortran or the parallel for directive in C/C++ to convert
them into parallel regions.
It might be surprising that there is no race condition in the current shared-memory
parallel algorithm. As mentioned earlier, ψ, G, and the heap H are defined locally in
each block. In Algorithm 2.5, the thread working on block ib will access the points
within its neighboring block nb to possibly update the ghost cells of block ib. Because
of the non-overlapping domain decomposition adopted in this work, the operations on
the ghost cells in the exchange step are totally isolated from those on physical cells
within the block in the marching step. On the other hand, the thread working on
block ib will also access Cnb of the neighboring block nb in Algorithm 2.4. Again, this
won’t make it a race condition as only the element corresponding to ib in Cnb could
be changed.
3. Results.
3.1. Test cases. Four point source problems tested in [15] were performed. For
simplicity, a unit cube [−0.5, 0.5] × [−0.5, 0.5] × [−0.5, 0.5] was used as the physical
domain and the point source was set at the domain center. Three cases were derived
from [2] with speed functions defined by
F (x, y, z) = 1,(3.1)
F (x, y, z) = 1 + 0.50 sin(20pix) sin(20piy) sin(20piz),(3.2)
F (x, y, z) = 1− 0.99 sin(2pix) sin(2piy) sin(2piz),(3.3)
FAST MARCHING METHOD WITH SUPERIOR PERFORMANCE 11
Algorithm 2.6 Block-based, restarted narrow-band fast marching method:
Block Based Restarted Narrow Band Fast Marching.
1: boundband ← 0
2: C← FALSE
3: for all block ib such that 1 ≤ ib ≤ nblocks do
4: Initialize Heap(ib)
5: end for
6: loop
7: boundband ← min (boundband + stride, widthband)
8: for all block ib such that 1 ≤ ib ≤ nblocks and Aib = TRUE do
9: March Narrow Band(ib)
10: end for
11: for all block ib such that 1 ≤ ib ≤ nblocks and any(Cib = TRUE) do
12: Integrate Ghost Cell Data(ib)
13: end for
14: if all(A = FALSE) then
15: exit loop
16: end if
17: end loop
respectively. For the last case, in the domain of F = 1, there are four concentric
spherical obstacles of F = 0 defined by
(3.4)
(0.15 < R < 0.15 + w) \ ((r < 0.05) ∩ (z < 0));
(0.25 < R < 0.25 + w) \ ((r < 0.10) ∩ (z > 0));
(0.35 < R < 0.35 + w) \ ((r < 0.10) ∩ (z < 0));
(0.45 < R < 0.45 + w) \ ((r < 0.10) ∩ (z > 0)),
where \ represents the set difference operation, w = 124 , R =
√
x2 + y2 + z2, and
r =
√
x2 + y2. This case was derived from [3].
Four grids of uniform spacing ∆h in all three directions were used and the number
of points in each direction (excluding the ghost points) was dimGrid = 128, 256, 512,
and 1024, respectively. Eight block sizes were tested, in which the number of grid
points in each direction of the block was dimBlock = 8, 16, 32, 64, 128, 256, 512, and
1024, respectively. Therefore, the partition in each direction is the same for every
configuration and is simply labelled as mBlocks here. The total number of blocks
for the whole domain is nBlocks = mBlocks3. Unlike the stationary domain decom-
position for distributed-memory parallelization in [15], here the number of blocks is
not restricted by the number of threads, nThreads, used for the computation as long
as nThreads ≤ nBlocks. Hence, the block size dimBlock can be considered as a
free parameter with a range of [1, dimGrid] in this study. On the other hand, as a
restarted narrow band approach, stride introduced in [15] is still a free parameter
in this work. And the selection of the stride size δs also follows the test spectrum in
[15], i.e., δs = 0.5∆h, 1.5∆h, 2.0∆h, 2.5∆h, 3.5∆h, and ∞.
All computations were performed on two desktop computers. The first computer
is equipped with a 16-core/32-thread AMD Ryzen Threadripper 1950X 3.4 GHz pro-
cessor and 64 GBytes of DDR4 memory. The second computer is equipped with a
64-core/256-thread Intel Xeon Phi 7210 1.3 GHz processor and 96 GBytes of DDR4
memory. In this study, the algorithm was implemented in Fortran 2003. The code was
12 J. YANG
compiled in double precision using the Intel Fortran Compiler XE version 18.0.0 with
the -O3 optimization option. Compiling options -qopenmp and -qopenmp-stubs ware
used to generate the multi-threaded parallel program and the sequential program,
respectively. The sequential algorithm described in Sec. 2.1 with the modifications in
Sec. 2.2 was considered as the baseline algorithm for comparison.
In summary, four example problems were carried out on four grids. The grids
were decomposed using eight different block sizes as applicable. Three solvers were
applied to these test cases: (a) baseline (B) solver, which can treat the whole domain
only as a single block with a single process; (b) block-based sequential (S) solver,
which can handle a decomposed domain of different block sizes using a single process;
and (c) block-based parallel (P) solver, which can handle a decomposed domain of
different block sizes using multiple threads via OpenMP directives. The latter two
are also labeled as M because both of them can handle multiple blocks.
128 256 512 1024 2048
dimGrid
10-16
10-15
10-14
10-13
R
el
at
iv
e 
D
iff
er
en
ce
Case 1
Case 2
Case 3
Case 4
Line
ar
Fig. 2. The maximum relative differences between the multi-block and single-block solutions.
3.2. Accuracy. A comparative study was first performed to verify that the
block-based algorithms give the correct solutions. For each problem, a benchmark
solution (ψB) can be obtained on each grid using the baseline sequential solver. The
maximum relative difference between this solution and the solutions from the other
two solvers (ψM ) is defined as ψrd = maxi,j,k |ψBi,j,k − ψMi,j,k|/ψBi,j,k. Note that the
solutions using a single block from all three solvers are exactly the same. Therefore,
the maximum value of ψrd from all multi-block solutions on the same grid is obtained.
As shown in Fig. 2, it is evident that for all cases the differences are within the order
of machine accuracy for double precision floating point calculations. Also, as the grid
refines, the relative error grows following a linear or sub-linear rate. Both phenomena
are consistent with what observed in [15].
3.3. Overheads. In addition to the floating point differences in the solutions,
another aspect of the multi-block solvers can be checked is the algorithm overheads
extra to the baseline solver. Unlike the distributed-memory parallel algorithm in
[15], in which all functions extra to the marching step were called and executed, here
in a single-block setting Algorithm 2.5 won’t even be called in the two multi-block
solvers. Therefore, the overheads are mainly from setting up loops in Algorithm 2.6
and if checks in Algorithms 2.4 and 2.2. Of course, using OpenMP threading also
involves additional overheads such as OpenMP library startup, thread startup, loop
scheduling, etc. Especially, our current OpenMP parallelization was implemented
FAST MARCHING METHOD WITH SUPERIOR PERFORMANCE 13
0.
8
1.
2
CP
U
 T
im
e 
(se
c)
Case 1
10
20
CP
U
 T
im
e 
(se
c)
10
0
20
0
CP
U
 T
im
e 
(se
c)
0.
5
1.
5
2.
0
2.
5
3.
5 ∞ 0.
5
1.
5
2.
0
2.
5
3.
5 ∞
δs (∆h)
15
00
25
00
CP
U
 T
im
e 
(se
c)
Case 2
0.
5
1.
5
2.
0
2.
5
3.
5 ∞ 0.
5
1.
5
2.
0
2.
5
3.
5 ∞
δs (∆h)
Case 3
0.
5
1.
5
2.
0
2.
5
3.
5 ∞ 0.
5
1.
5
2.
0
2.
5
3.
5 ∞
δs (∆h)
Case 4
0.
5
1.
5
2.
0
2.
5
3.
5 ∞ 0.
5
1.
5
2.
0
2.
5
3.
5 ∞
δs (∆h)
dimGrid = 128
dimGrid = 256
dimGrid = 512
dimGrid = 1024
Fig. 3. Single-block runs on AMD Ryzen Threadripper: the CPU time as a function of the
stride size δs and the grid size dimGrid. The horizontal orange lines represent TB from the baseline
sequential solver. The left part (blank background) and right part (grey-shaded background) show TS
and TP from the multi-block sequential and parallel solvers, respectively.
5
6
7
8
CP
U
 T
im
e 
(10
0  
se
c)
Case 1
5
6
7
8
CP
U
 T
im
e 
(10
1  
se
c)
6
7
8
9
CP
U
 T
im
e 
(10
2  
se
c)
0.
5
1.
5
2.
0
2.
5
3.
5 ∞ 0.
5
1.
5
2.
0
2.
5
3.
5 ∞
δs (∆h)
6
7
8
9
CP
U
 T
im
e 
(10
3  
se
c)
Case 2
0.
5
1.
5
2.
0
2.
5
3.
5 ∞ 0.
5
1.
5
2.
0
2.
5
3.
5 ∞
δs (∆h)
Case 3
0.
5
1.
5
2.
0
2.
5
3.
5 ∞ 0.
5
1.
5
2.
0
2.
5
3.
5 ∞
δs (∆h)
Case 4
0.
5
1.
5
2.
0
2.
5
3.
5 ∞ 0.
5
1.
5
2.
0
2.
5
3.
5 ∞
δs (∆h)
dimGrid = 128
dimGrid = 256
dimGrid = 512
dimGrid = 1024
Fig. 4. Single-block runs on Intel Xeon Phi: the CPU time as a function of the stride size δs
and the grid size dimGrid. The horizontal orange lines represent TB from the baseline sequential
solver. The left part (blank background) and right part (grey-shaded background) show TS and TP
from the multi-block sequential and parallel solvers, respectively.
by tackling three separate loop regions (including one initialization loop) using the
OpenMP parallel directive, so the thread startup overhead is always there in each
restart. However, as discussed earlier, there is no race condition at all in our current
parallel algorithm, hence it does not involve lock management, which usually is the
major performance penalty in thread parallelization.
Figures 3 and 4 show the CPU times from the single-block runs of three solvers
on the AMD Ryzen Threadripper and Intel Xeon Phi computers, respectively. In
general, the former is a few times faster than the latter (e.g., more than 7 times for
small dimGrid, and less than 4 times for large dimGrid), but the overall trends of
14 J. YANG
the results from the two computers are very similar. On the former, the CPU time
TS can show up to 5% overhead with the coarsest grid and less than 1% overhead
with the finest grid, whereas on the latter, TS generally gives an overhead less than
1%. With the parallel solver, TP shows roughly around 1% extra overhead above TS
due to the extra OpenMP threading overheads discussed above. On the other hand,
case 2 is the most expensive one because of its fast varying speed function, case 3 is
slightly less expensive than case 2 as its speed function is smoother than that of case
2, and case 4 requires the least CPU time since the quadratic equation was not solved
on the grid points within the barriers.
0.5 1 2 4
Stride Size (∆h)
101
102
103
104
N
um
be
r o
f R
es
ta
rts
Case 1
0.5 1 2 4
Stride Size (∆h)
Case 2
0.5 1 2 4
Stride Size (∆h)
Case 3
0.5 1 2 4
Stride Size (∆h)
dimGrid = 128
dimGrid = 256
dimGrid = 512
dimGrid = 1024
Case 4
Fig. 5. The number of restarts nr as a function of the stride size δs (from δs = 0.5∆h to
δs = 3.5∆h) and the grid size dimGrid.
3.4. Restarts. As shown in Figure 5, the number of restarts nr is clearly a linear
function of the stride size δs and the grid size dimGrid for all test cases. A major
difference between the results from the current block-based algorithm and those from
the parallel algorithm in [15] is that the number of restarts shows no dependence on
the domain decomposition configurations here. That is, for a given stride size on a
given grid, no matter what is the number of partitions mBlocks, both multi-block
solvers gave exactly the same number of restarts and there is no need to specify
mBlocks in the figure. As an explanation, the narrow band bound for each restart in
the current algorithm is preset and evenly distributed. Therefore, unlike the original
restart narrow band approach, it does not rely on the intermediate solution at all.
For the extreme case δs = ∞, Figure 6 shows the number of restarts nr as a
function of the grid size dimGrid and the number of partitions mBlocks. It is evident
that there is a very weak dependence of nr on the grid size as the data from coarser
grids almost collapse on those from the finer grids. In addition, the number of restarts
asymptotically approaches a linear relationship with mBlocks for all four problems.
Although the choice of δs =∞ is generally not recommended in practical applications
of the current algorithm, the results here do verify that, unlike iterative methods,
our new algorithm retains the non-iterative property of the original sequential fast
marching method and it does not reply on any user-specified convergence criteria.
3.5. Sequential performance. Figures 7 and 8 show the CPU time TS as a
function of the block size dimBlock, the stride size δs, and the grid size dimGrid for
the multi-block sequential solver on the AMD Ryzen Threadripper and the Intel Xeon
Phi computers, respectively. A striking finding from the results is the much improved
performance on the decomposed domain, especially, around dimBlock = 64 on finer
grids. Only Case 2 on the coarsest grid, dimGrid = 128, didn’t benefit from the
domain decomposition. The results from both computers follow a very similar trend.
In general, the improvements become more pronounced and the optimal dimBlock
FAST MARCHING METHOD WITH SUPERIOR PERFORMANCE 15
1 2 4 8 16 32 64 128
mBlocks
1
4
16
64
256
1024
N
um
be
r o
f R
es
ta
rts
Case 1
1 2 4 8 16 32 64 128
mBlocks
dimGrid = 1024
dimGrid = 512
dimGrid = 256
dimGrid = 128
Case 2
1 2 4 8 16 32 64 128
mBlocks
Case 3
1 2 4 8 16 32 64 128
mBlocks
Case 4
Fig. 6. The number of restarts nr as a function of the number of partitions mBlocks and the
grid size dimGrid for the stride size δs =∞.
converges to 64 for all four cases as the grid refines. Interestingly, the choice of
δs = ∞ gave the best performance for Case 1, which has a simple speed function
of F = 1, but failed to compete with other values of δs for other cases, except the
mBlocks = 2 configuration in Case 4 due to the special arrangement of the obstacles
in the domain. This is a clear indication of the great merit of our restarted narrow
band approach in general situations. On the other hand, the choice of δs = 0.5∆h
performed better than others on grids dimGrid = 128 and 256 for Case 2. Apparently
more frequent data exchange among neighboring blocks works better for the situation
of a fast changing speed function on a coarse grid. The results are not very sensitive
to the difference in δs around δs = 2∆h. But similarly, an increased frequency of
data exchange (i.e., a decreased stride size from δs = 3.5∆h to 1.5∆h) gives a better
performance for a smaller dimBlock.
The corresponding speedups of the multi-block sequential runs calculated using
TB/TS are shown in Figures 9 and 10. The extreme option of δs = ∞ gave a very
high speedup for Case 1, but performed poorly in other cases. As discussed above, it
should not be considered as a general choice for our restarted narrow band approach
in practical applications, hence it won’t be included in the discussions hereafter except
otherwise mentioned. On the AMD Ryzen Threadripper computer, the computation
of Case 1 was only slightly accelerated on the grid dimGrid = 128, but it gave a
speedup more than 2, 3, and 5 on the grid dimGrid = 256, 512, and 1024, respec-
tively. The other three cases have more complex speed functions, and the highest
speedups are relatively lower, e.g., around 3.5 on the finest grid. On the Intel Xeon
Phi computer, the differences in performance among these cases are not so significant,
although the speedup is above 3 for Case 1 and around 2.5 for other three cases on the
finest grid. In general, the performance of the multi-block sequential solver on two
different computers is very consistent. Computations with dimBlock = 8 were hardly
accelerated on coarser grids, e.g., dimGrid = 128, but those with dimBlock = 16 were
mostly sped up. The optimal block size is around dimBlock = 64 and a stride size
around 2∆h seems to be a feasible choice for different situations. The latter is also
consistent with the finding from [15], although non-overlapping domain decomposition
is used in the current work.
3.6. Parallel performance. Although our multi-block sequential solver con-
siderably outperforms the baseline sequential fast marching solver as shown above,
it is not the ultimate goal of current study when multi-core and many-core com-
puters become ubiquitous in today’s technical computing environment. In this part,
the results from the multi-block parallel solver on two such computers are presented
and discussed. The CPUs on both computers support Simultaneous Multi-Threading
16 J. YANG
81632641282565121024
dimBlock
10-1
100
101
102
103
104
CP
U
 T
im
e 
(se
c)
Case 1
81632641282565121024
dimBlock
Case 2
81632641282565121024
dimBlock
Case 3
81632641282565121024
dimBlock
Case 4
δs = 0.5∆h
δs = 1.5∆h
δs = 2.0∆h
δs = 2.5∆h
δs = 3.5∆h
δs = ∞
dimGrid = 1024
dimGrid = 512
dimGrid = 256
dimGrid = 128
Fig. 7. Multi-block sequential runs on AMD Ryzen Threadripper: the CPU time TS as a
function of the block size dimBlock, the stride size δs, and the grid size dimGrid.
81632641282565121024
dimBlock
100
101
102
103
104
105
CP
U
 T
im
e 
(se
c)
Case 1
81632641282565121024
dimBlock
Case 2
81632641282565121024
dimBlock
Case 3
81632641282565121024
dimBlock
Case 4
δs = 0.5∆h
δs = 1.5∆h
δs = 2.0∆h
δs = 2.5∆h
δs = 3.5∆h
δs = ∞
dimGrid = 1024
dimGrid = 512
dimGrid = 256
dimGrid = 128
Fig. 8. Multi-block sequential runs on Intel Xeon Phi: the CPU time TS as a function of the
block size dimBlock, the stride size δs, and the grid size dimGrid.
(SMT) technology, i.e., one physical CPU core is capable of running two or more
threads simultaneously, for improved parallelization. In particular, the Xeon Phi
many-core processor supports Intel’s Hyper-Threading technology and is capable of
running 256 threads with 64 physical CPU cores. Solving the Eikonal equation with
hundreds of threads represents a shared-memory parallel computing of a significantly
larger scale than previously demonstrated in the literature. To distinguish the results
of parallel computations employing SMT from those using one thread per physical
core, the former are marked with a grey-shaded background in the figures, i.e., the
results obtained from 64 threads on the AMD Ryzen Threadripper computer and
FAST MARCHING METHOD WITH SUPERIOR PERFORMANCE 17
2
4
6
Sp
ee
du
p
Case 1
1
2
3
4
Sp
ee
du
p
1
2
3
Sp
ee
du
p
81632641282565121024
dimBlock
0.5
1.0
1.5
Sp
ee
du
p
Case 2
81632641282565121024
dimBlock
Case 3
81632641282565121024
dimBlock
Case 4
81632641282565121024
dimBlock
δs = 0.5∆h
δs = 1.5∆h
δs = 2.0∆h
δs = 2.5∆h
δs = 3.5∆h
δs = ∞
dimGrid = 1024
dimGrid = 512
dimGrid = 256
dimGrid = 128
Fig. 9. Multi-block sequential runs on AMD Ryzen Threadripper: the sequential speedup as
functions of the block size dimBlock, the stride size δs, and the grid size dimGrid.
1
2
3
Sp
ee
du
p
Case 1
1
2
3
Sp
ee
du
p
0.5
1.0
1.5
2.0
Sp
ee
du
p
81632641282565121024
dimBlock
0.5
1.0
1.5
Sp
ee
du
p
Case 2
81632641282565121024
dimBlock
Case 3
81632641282565121024
dimBlock
Case 4
81632641282565121024
dimBlock
δs = 0.5∆h
δs = 1.5∆h
δs = 2.0∆h
δs = 2.5∆h
δs = 3.5∆h
δs = ∞
dimGrid = 1024
dimGrid = 512
dimGrid = 256
dimGrid = 128
Fig. 10. Multi-block sequential runs on Intel Xeon Phi: the sequential speedup as functions of
the block size dimBlock, the stride size δs, and the grid size dimGrid.
those from 128 and 256 threads on the Intel Xeon Phi computer.
Figures 11 and 12 present the CPU time TP as functions of the number of threads
nthreads, the stride size δs, the block size dimBlock, and the grid size dimGrid on
these two computers, respectively. A significant observation from these results is that
all cases, regardless of the speed function, the grid size, and the choices of the two
algorithm parameters, were greatly accelerated along with the increase of nthreads
from 1 to the number of physical CPU cores on both computers. The only common
exception seems to be the runs with nBlocks = 8 for Case 4, which did not gain
much acceleration when nthreads increased from 4 to 8, apparently due to the special
18 J. YANG
dimGrid = 1024
dimGrid = 512
dimGrid = 256
dimGrid = 128
δs = 0.5∆h
δs = 1.5∆h
δs = 2.0∆h
δs = 2.5∆h
δs = 3.5∆h
δs = ∞
1 2 4 8 16 32
nThreads
dimBlock = 8
1 2 4 8 16 32
nThreads
dimBlock = 16
1 2 4 8 16 32
nThreads
dimBlock = 32
1 2 4 8 16 32
nThreads
dimBlock = 64
1 2 4 8 16 32
nThreads
dimBlock = 128
1 2 4 8 16 32
nThreads
dimBlock = 256
1 2 4 8 16 32
nThreads
10-2
10-1
100
101
102
103
104
CP
U
 T
im
e 
(se
c)
dimBlock = 512
Case 1
1 2 4 8 16 32
nThreads
dimBlock = 8
1 2 4 8 16 32
nThreads
dimBlock = 16
1 2 4 8 16 32
nThreads
dimBlock = 32
1 2 4 8 16 32
nThreads
dimBlock = 64
1 2 4 8 16 32
nThreads
dimBlock = 128
1 2 4 8 16 32
nThreads
dimBlock = 256
1 2 4 8 16 32
nThreads
10-1
100
101
102
103
104
CP
U
 T
im
e 
(se
c)
dimBlock = 512
Case 2
1 2 4 8 16 32
nThreads
dimBlock = 8
1 2 4 8 16 32
nThreads
dimBlock = 16
1 2 4 8 16 32
nThreads
dimBlock = 32
1 2 4 8 16 32
nThreads
dimBlock = 64
1 2 4 8 16 32
nThreads
dimBlock = 128
1 2 4 8 16 32
nThreads
dimBlock = 256
1 2 4 8 16 32
nThreads
10-1
100
101
102
103
104
CP
U
 T
im
e 
(se
c)
dimBlock = 512
Case 3
1 2 4 8 16 32
nThreads
dimBlock = 8
1 2 4 8 16 32
nThreads
dimBlock = 16
1 2 4 8 16 32
nThreads
dimBlock = 32
1 2 4 8 16 32
nThreads
dimBlock = 64
1 2 4 8 16 32
nThreads
dimBlock = 128
1 2 4 8 16 32
nThreads
dimBlock = 256
1 2 4 8 16 32
nThreads
10-1
100
101
102
103
104
CP
U
 T
im
e 
(se
c)
dimBlock = 512
Case 4
Fig. 11. Multi-block parallel runs on AMD Ryzen Threadripper: the CPU time TP as functions
of the number of threads nthreads, the stride size δs, the block size dimBlock, and the grid size
dimGrid.
arrangement of the obstacles in the domain. The unfavorable effects of the obstacles
in Case 4 on the parallel performance also showed up for runs with nBlocks = 64
when nthreads increased to 32 and 64 on the Intel Xeon Phi computer, whereas
those on the AMD Ryzen Threadripper computer were not similarly affected. With
nthreads = nBlocks, most of these runs somehow resemble the stationary domain
decomposition in [15], in which the load imbalance among different subdomains was
FAST MARCHING METHOD WITH SUPERIOR PERFORMANCE 19
dimGrid = 1024
dimGrid = 512
dimGrid = 256
dimGrid = 128
δs = 0.5∆h
δs = 1.5∆h
δs = 2.0∆h
δs = 2.5∆h
δs = 3.5∆h
δs = ∞
1 4 16 64 256
nThreads
dimBlock = 8
1 4 16 64 256
nThreads
dimBlock = 16
1 4 16 64 256
nThreads
dimBlock = 32
1 4 16 64 256
nThreads
dimBlock = 64
1 4 16 64 256
nThreads
dimBlock = 128
1 4 16 64 256
nThreads
dimBlock = 256
1 4 16 64 256
nThreads
10-2
10-1
100
101
102
103
104
CP
U
 T
im
e 
(se
c)
dimBlock = 512
Case 1
1 4 16 64 256
nThreads
dimBlock = 8
1 4 16 64 256
nThreads
dimBlock = 16
1 4 16 64 256
nThreads
dimBlock = 32
1 4 16 64 256
nThreads
dimBlock = 64
1 4 16 64 256
nThreads
dimBlock = 128
1 4 16 64 256
nThreads
dimBlock = 256
1 4 16 64 256
nThreads
10-1
100
101
102
103
104
105
CP
U
 T
im
e 
(se
c)
dimBlock = 512
Case 2
1 4 16 64 256
nThreads
dimBlock = 8
1 4 16 64 256
nThreads
dimBlock = 16
1 4 16 64 256
nThreads
dimBlock = 32
1 4 16 64 256
nThreads
dimBlock = 64
1 4 16 64 256
nThreads
dimBlock = 128
1 4 16 64 256
nThreads
dimBlock = 256
1 4 16 64 256
nThreads
10-1
100
101
102
103
104
105
CP
U
 T
im
e 
(se
c)
dimBlock = 512
Case 3
1 4 16 64 256
nThreads
dimBlock = 8
1 4 16 64 256
nThreads
dimBlock = 16
1 4 16 64 256
nThreads
dimBlock = 32
1 4 16 64 256
nThreads
dimBlock = 64
1 4 16 64 256
nThreads
dimBlock = 128
1 4 16 64 256
nThreads
dimBlock = 256
1 4 16 64 256
nThreads
10-1
100
101
102
103
104
105
CP
U
 T
im
e 
(se
c)
dimBlock = 512
Case 4
Fig. 12. Multi-block parallel runs on Intel Xeon Phi: the CPU time TP as functions of the
number of threads nthreads, the stride size δs, the block size dimBlock, and the grid size dimGrid.
hardly mitigated. On the other hand, the overall parallel scalability demonstrated
here is truly remarkable, as the CPU time decreased almost linearly along with the
increase of the number of threads for most cases, even including δs =∞ and 0.5∆h.
For the parallel runs, the effects of the stride size barely show any difference from
those in the sequential runs as the number of threads increases. A value around
2∆h for δs remains the rule-of-thumb choice for achieving a superior performance.
However, there does exist a relative trend, especially for runs with dimBlock ≤ 32,
20 J. YANG
that the performance of a smaller δs declines (e.g., δs = 0.5∆h) and that of a larger
δs (e.g., δs = ∞) improves as nthreads increases. In general, it is preferred to have
the optimal values of a free parameter restricted within a stable and narrow range.
The fact that the stride size here is such a free parameter confirms the robustness and
applicability of our restarted narrow band approach.
The other free parameter in our current algorithm is the block size dimBlock.
In general, the CPU time TP showed a trend of almost linear decrease along with
the increase of nthreads from 1 to the number of physical CPU cores, except some
runs with a limited number of blocks because of mBlocks = 2 (and mBlocks = 4
on the Intel Xeon Phi computer) for Case 4. In a multi-thread setting, the optimal
block size seems to shift down from dimBlock = 64 for the sequential runs towards
a value around dimBlock = 32. For coarser grids, the choice of dimBlock = 16
gave slightly better results than dimBlock = 32 for some runs. On the other hand,
dimBlock = 64 provided the best performance on finer grids for some cases on the
AMD Ryzen Threadripper computer. Therefore, similar to the situation of the stride
size discussed above, the block size dimBlock in our block-based algorithm has a
stable narrow range for achieving the best performance. The rule-of-thumb choice is
a value around dimBlock = 32, and it can be shifted up for a very fine grid or down for
a coarse one to obtain a comparable or moderately improved performance. Another
preferred feature that can be observed in the results is the apparent independence
between the block size and the stride size. As discussed above, some results with
δs = 0.5∆h showed a weak reliance on the block size for runs with dimBlock = 8 and
16. But again, a very small stride size such as δs = 0.5∆h is not recommended to be
used in practical applications of the restarted narrow band approach anyway.
The results also demonstrated the effects of SMT on the parallel performance.
It is evident that on both computers SMT barely offered any improvements when
dimBlock ≥ 128 except a few runs on the finest grids. With smaller block sizes on the
AMD Ryzen Threadripper computer, however, the CPU time continued to decrease
when the number of threads on each physical core increased from one to two. The
SMT acceleration on finer grids was actually rather prominent, which illustrates the
unambiguous benefits of SMT on this computer with our algorithm. On the other
hand, running the multi-thread solver with SMT (or hyper-threading) activated on the
Intel Xeon Phi computer showed disappointing, mostly negative effects on the parallel
performance. Except for a few runs (e.g., dimBlock = 32 on grid dimGrid = 1024
for Case 1 and Case 2), four threads on each physical CPU core, for a total of 256
threads, generally required more time to complete the computations compared with
the other scenarios (one or two threads per core). The runs with 128 threads through
two threads per core showed moderate gains on finer grids in most cases except Case
4. Surprisingly, the runs with δs = ∞ seem to benefit from hyper-threading on this
computer, especially for dimBlock = 32 and 16. It should be noted that, apart from
differences between the two computers, the load imbalance also played a larger role
here for the less encouraging SMT results compared with those on the AMD Ryzen
Threadripper computer: e.g., for mBlocks = 8 the total number of blocks would
be 512, then with nthreads = 128 or 256 each thread would be handling only 4 or
merely 2 blocks! And this is the scenario for dimBlock = 128 on grid dimGrid = 1024,
dimBlock = 64 on grid dimGrid = 512, dimBlock = 32 on grid dimGrid = 256, or
dimBlock = 16 on grid dimGrid = 128.
Figures 13 and 14 show the corresponding parallel speedups on both computers.
Here the absolute speedup is used, which is defined as S = TP /TB with TB the CPU
time obtained from the single-block sequential solver. It is also possible to use the
FAST MARCHING METHOD WITH SUPERIOR PERFORMANCE 21
relative speedup defined using the CPU time in this part with nthreads = 1 as the
reference, but in that case it could not be a single value as the CPU time of the multi-
block solver depends on the choices of dimBlock and δs. Since the multi-block solvers
can be several times faster than the single-block solver as shown in the previous part,
it is interesting to notice that most curves in these two figures start from points much
higher than one at nthreads = 1. Moreover, because the CPU time and the number
of threads are almost inversely proportional starting from nthreads = 1 in most
cases, the combined effect is that the maximum speedup can be much higher than
the number of threads. For instance, on the AMD Ryzen Threadripper computer,
the maximum speedup with 32 threads on 16 cores is more than 80 for Case 1 (up to
120 with δs =∞), more than 60 for Case 2, more than 59 for Case 3, and more than
47 for Case 4, all on the finest grid. On the Intel Xeon Phi computer, the maximum
speedup with up to 256 threads on 64 cores is more than 220 (up to 291 with δs =∞)
for Case 1, up to 190 for Case 2, up to 157 for Case 3, and up to 103 for Case 4,
also all on the finest grid. These data showing the super-linear speedup phenomena
are highly remarkable for a parallel algorithm of the fast marching method, which
was long considered inherently sequential. Furthermore, the parallel performance on
coarser grids is also very impressive, with a maximum speedup surpassing or close to
the number of cores in most tests.
The corresponding parallel efficiency, which is defined as E = S/nthreads, is
shown in Figures 15 and 16 for tests on both computers. The comprehensive super-
linear speedup phenomena of our parallel algorithm, shows as parallel efficiencies
higher than E = 1 in the figures, become even easier to recognize than those in the
speedup plots above. It is also evident in the figures that, in terms of free parameters,
dimBlock = 32 and δs = 2∆h are the overall optima for giving consistently high
parallel efficiencies for vastly different speed functions and grid sizes. In addition, we
can barely recognize any interdependence between the block size and the stride size,
if neglecting the extreme cases δs =∞ and 0.5∆h.
4. Discussions. Our algorithm generally shows a lower parallel efficiency on the
Intel Xeon Phi computer than that on the AMD Ryzen Threadripper computer. The
AMD Ryzen Threadripper 1950X processor has 8 MB L2 cache (512 KB per core) and
32 MB L3 cache, whereas the Intel Xeon Phi 7210 has 32 MB L2 cache (two cores share
1 MB L2 cache). One particular feature of the Intel Xeon Phi processor is that it has
no L3 cache, but has 16GB on-package high-bandwidth memory called MultiChannel
Dynamic Random Access Memory (MCDRAM), which can provide a total memory
bandwidth of five times of the off-package DDR (Double Data Rate) memory (up to
450 GB/sec vs. up to 90 GB/sec), but with a similar idle latency (approximately
150 ns vs. approximately 125 ns). Different programming modes are available for
the MCDRAM and DDR memory. Here we chose the default modes, i.e., the Cache
Mode for the former (100% MCDRAM as cache) and the quadrant cluster mode for
the latter. It is well-known that a binary heap is not a cache friendly data structure
due to the difficulties of maintaining the locality of references in heap operations. In
this work, a simple 3D array was used to store the function value, status tag, and
heap position of each grid point in each subdomain, thus no particular optimizations
were implemented to improve heap performance. Apparently the cache miss penalties
of using MCDRAM (although in Cache Model) were much higher than those of a
real L3 cache due to the much higher latency of the former. This also explains why
using hyper-threading with two or four threads per core hardly boosted the parallel
performance as shown earlier. Nonetheless, the superior performance of our algorithm
22 J. YANG
dimGrid = 1024
dimGrid = 512
dimGrid = 256
dimGrid = 128
δs = 0.5∆h
δs = 1.5∆h
δs = 2.0∆h
δs = 2.5∆h
δs = 3.5∆h
δs = ∞
1 2 4 8 16 32
nThreads
dimBlock = 8
1 2 4 8 16 32
nThreads
dimBlock = 16
1 2 4 8 16 32
nThreads
dimBlock = 32
1 2 4 8 16 32
nThreads
dimBlock = 64
1 2 4 8 16 32
nThreads
dimBlock = 128
1 2 4 8 16 32
nThreads
dimBlock = 256
1 2 4 8 16 32
nThreads
0.5
1
2
4
8
16
32
64
128
Pa
ra
lle
l S
pe
ed
up
dimBlock = 512
Case 1
1 2 4 8 16 32
nThreads
dimBlock = 8
1 2 4 8 16 32
nThreads
dimBlock = 16
1 2 4 8 16 32
nThreads
dimBlock = 32
1 2 4 8 16 32
nThreads
dimBlock = 64
1 2 4 8 16 32
nThreads
dimBlock = 128
1 2 4 8 16 32
nThreads
dimBlock = 256
1 2 4 8 16 32
nThreads
.25
.5
1
2
4
8
16
32
64
Pa
ra
lle
l S
pe
ed
up
dimBlock = 512
Case 2
1 2 4 8 16 32
nThreads
dimBlock = 8
1 2 4 8 16 32
nThreads
dimBlock = 16
1 2 4 8 16 32
nThreads
dimBlock = 32
1 2 4 8 16 32
nThreads
dimBlock = 64
1 2 4 8 16 32
nThreads
dimBlock = 128
1 2 4 8 16 32
nThreads
dimBlock = 256
1 2 4 8 16 32
nThreads
.5
1
2
4
8
16
32
64
Pa
ra
lle
l S
pe
ed
up
dimBlock = 512
Case 3
1 2 4 8 16 32
nThreads
dimBlock = 8
1 2 4 8 16 32
nThreads
dimBlock = 16
1 2 4 8 16 32
nThreads
dimBlock = 32
1 2 4 8 16 32
nThreads
dimBlock = 64
1 2 4 8 16 32
nThreads
dimBlock = 128
1 2 4 8 16 32
nThreads
dimBlock = 256
1 2 4 8 16 32
nThreads
.25
.5
1
2
4
8
16
32
64
Pa
ra
lle
l S
pe
ed
up
dimBlock = 512
Case 4
Fig. 13. Multi-block parallel runs on AMD Ryzen Threadripper: the parallel speedup as func-
tions of the number of threads nthreads, the stride size δs, the block size dimBlock, and the grid
size dimGrid.
was still perfectly demonstrated on this 64-core computer by the extraordinary parallel
speedup of two orders of magnitude for widely varying speed functions.
A very attractive characteristic of the fast marching method is its robustness
in dealing with complex speed functions, and this feature is retained in our current
algorithm. Of course, the CPU time still increases as the range of variations in the
speed function increases. But the variations of CPU time generally are limited within
FAST MARCHING METHOD WITH SUPERIOR PERFORMANCE 23
dimGrid = 1024
dimGrid = 512
dimGrid = 256
dimGrid = 128
δs = 0.5∆h
δs = 1.5∆h
δs = 2.0∆h
δs = 2.5∆h
δs = 3.5∆h
δs = ∞
1 4 16 64 256
nThreads
dimBlock = 8
1 4 16 64 256
nThreads
dimBlock = 16
1 4 16 64 256
nThreads
dimBlock = 32
1 4 16 64 256
nThreads
dimBlock = 64
1 4 16 64 256
nThreads
dimBlock = 128
1 4 16 64 256
nThreads
dimBlock = 256
1 4 16 64 256
nThreads
1
2
4
8
16
32
64
128
256
512
Pa
ra
lle
l S
pe
ed
up
dimBlock = 512
Case 1
1 4 16 64 256
nThreads
dimBlock = 8
1 4 16 64 256
nThreads
dimBlock = 16
1 4 16 64 256
nThreads
dimBlock = 32
1 4 16 64 256
nThreads
dimBlock = 64
1 4 16 64 256
nThreads
dimBlock = 128
1 4 16 64 256
nThreads
dimBlock = 256
1 4 16 64 256
nThreads
.25
.5
1
2
4
8
16
32
64
128
256
Pa
ra
lle
l S
pe
ed
up
dimBlock = 512
Case 2
1 4 16 64 256
nThreads
dimBlock = 8
1 4 16 64 256
nThreads
dimBlock = 16
1 4 16 64 256
nThreads
dimBlock = 32
1 4 16 64 256
nThreads
dimBlock = 64
1 4 16 64 256
nThreads
dimBlock = 128
1 4 16 64 256
nThreads
dimBlock = 256
1 4 16 64 256
nThreads
.5
1
2
4
8
16
32
64
128
256
Pa
ra
lle
l S
pe
ed
up
dimBlock = 512
Case 3
1 4 16 64 256
nThreads
dimBlock = 8
1 4 16 64 256
nThreads
dimBlock = 16
1 4 16 64 256
nThreads
dimBlock = 32
1 4 16 64 256
nThreads
dimBlock = 64
1 4 16 64 256
nThreads
dimBlock = 128
1 4 16 64 256
nThreads
dimBlock = 256
1 4 16 64 256
nThreads
.25
.5
1
2
4
8
16
32
64
128
Pa
ra
lle
l S
pe
ed
up
dimBlock = 512
Case 4
Fig. 14. Multi-block parallel runs on Intel Xeon Phi: the parallel speedups as function of the
number of threads nthreads, the stride size δs, the block size dimBlock, and the grid size dimGrid.
a very narrow range, as shown in the results of different speed functions on the same
grid. On the other hand, in the last case, the front propagates through narrow curved
passages between multiple nonpenetrable barriers (F = 0), which could present a
major challenge to many iterative methods in terms of the overall efficiency.
Another important feature of our new algorithm is its resemblance to the original
sequential fast marching method. In the block-based algorithm, the data structures
for each block exactly follow the original single-block sequential version, except a
24 J. YANG
few global arrays for storing the activation statuses of each block in the marching
and exchange steps. It should be noted that the block activation mechanisms do
create two lists of blocks in Algorithm 2.6. But unlike the ordered lists in some other
approaches [14, 2], the elements in the lists do not possess any type of priority at all
in our algorithm and they are handled by different threads in orders defined by the
OpenMP loop scheduling mechanism. The main change is the addition of Algorithm
2.5, which actually shares the same major operations with Algorithm 2.2.
The restarted narrow band approach in the current study still closely follows the
original one proposed in [15], although with several simplifications. Depending on the
speed function and the choice of grid size, domain decomposition configuration, and
stride size, there are a various amount of refreshing computations in a multi-block run
that are extra to a single-block run. In the multi-block sequential results examined
earlier, since only one CPU core was involved, the CPU time differences among blocks
(i.e., load imbalance for parallel computing) do not affect the overall performance.
Therefore, the main contributor to the remarkable acceleration shown there is the
reduced heap size, as each block has its own heap. In terms of algorithm complexity,
the worst-case scenario isO(N logN) for a total ofN grid points. Now in a multi-block
setting with a total of nBlocks, we have nBlocks×( N
nBlocks
log N
nBlocks
)
= N log N
nBlocks
.
The savings of heap operations can be substantial for a large nblocks. Of course,
because each block is padded with ghost cells, the memory overhead extra to a single-
block run grows with nblocks too. Nonetheless, the case with many short heaps
apparently performs better than the one with one single long heap in most situations.
It should be noted that sometimes the overhead of refreshing computations due to
a complex speed function can be significant, as shown in the multi-block sequential
runs on grid dimGrid = 128 for Case 2. The effectiveness of our restarted narrow
band strategy in reducing excessive refreshing computations can be easily grasped by
comparing the results from a stride size around δs = 2∆h and those of δs = ∞. Of
course, sometimes δs =∞ may give a better performance, when the interdependence
between blocks is weak or simple, as shown in some results for Case 1 (F = 1). But
when a complex speed function results in a strong interdependence between blocks,
such as in Case 2, the importance of the restarted narrow band approach becomes
evident.
In the multi-thread runs, the tests with nthreads = nblocks encountered the
condition similar to that in [15] with a one-to-one relationship between threads and
blocks. In such circumstances, computations on some threads may become intermit-
tent and the load imbalance among threads cannot be avoided except for a perfectly
divided load like nblocks = 8 for Case 1. Of course, nthreads = nblocks is a unique
scenario, but it is conceivable that the block activation mechanisms introduced in this
study won’t work as expected if the number of blocks is equal or close to the num-
ber of threads. In general, our current approach relies on an adequately decomposed
domain to ensure nblocks  nthreads, such that the block activation mechanisms
can kick in meaningfully to skip blocks not engaged with the front. This is verified
by the comprehensive superior performance on finer grids with the same block size.
On the other hand, as mblocks grows, the memory overhead for padding ghost cells
also grows fast, as shown by the prevailing performance descent shown in the results
when the block size was reduced to dimBlock = 8. Fortunately, our results have
shown that the choice of dimBlock = 32 reasonably balances the trade-off between
load imbalance and memory overhead. It is interesting to note that as the number
of threads increases, the optimal block size seems to shift from 64 toward 32. This
could possibly be explained by the smaller heap sizes and more balanced loads with
FAST MARCHING METHOD WITH SUPERIOR PERFORMANCE 25
dimBlock = 32 in spite of the memory overhead increase.
dimGrid = 1024
dimGrid = 512
dimGrid = 256
dimGrid = 128
δs = 0.5∆h
δs = 1.5∆h
δs = 2.0∆h
δs = 2.5∆h
δs = 3.5∆h
δs = ∞
1 2 4 8 16 32
nThreads
dimBlock = 8
1 2 4 8 16 32
nThreads
dimBlock = 16
1 2 4 8 16 32
nThreads
dimBlock = 32
1 2 4 8 16 32
nThreads
dimBlock = 64
1 2 4 8 16 32
nThreads
dimBlock = 128
1 2 4 8 16 32
nThreads
dimBlock = 256
1 2 4 8 16 32
nThreads
0
1
2
3
4
5
6
Pa
ra
lle
l E
ffi
ci
en
cy
dimBlock = 512
Case 1
1 2 4 8 16 32
nThreads
dimBlock = 8
1 2 4 8 16 32
nThreads
dimBlock = 16
1 2 4 8 16 32
nThreads
dimBlock = 32
1 2 4 8 16 32
nThreads
dimBlock = 64
1 2 4 8 16 32
nThreads
dimBlock = 128
1 2 4 8 16 32
nThreads
dimBlock = 256
1 2 4 8 16 32
nThreads
0
0.5
1
1.5
2
2.5
3
3.5
4
Pa
ra
lle
l E
ffi
ci
en
cy
dimBlock = 512
Case 2
1 2 4 8 16 32
nThreads
dimBlock = 8
1 2 4 8 16 32
nThreads
dimBlock = 16
1 2 4 8 16 32
nThreads
dimBlock = 32
1 2 4 8 16 32
nThreads
dimBlock = 64
1 2 4 8 16 32
nThreads
dimBlock = 128
1 2 4 8 16 32
nThreads
dimBlock = 256
1 2 4 8 16 32
nThreads
0
0.5
1
1.5
2
2.5
3
3.5
4
Pa
ra
lle
l E
ffi
ci
en
cy
dimBlock = 512
Case 3
1 2 4 8 16 32
nThreads
dimBlock = 8
1 2 4 8 16 32
nThreads
dimBlock = 16
1 2 4 8 16 32
nThreads
dimBlock = 32
1 2 4 8 16 32
nThreads
dimBlock = 64
1 2 4 8 16 32
nThreads
dimBlock = 128
1 2 4 8 16 32
nThreads
dimBlock = 256
1 2 4 8 16 32
nThreads
0
0.5
1
1.5
2
2.5
3
3.5
Pa
ra
lle
l E
ffi
ci
en
cy
dimBlock = 512
Case 4
Fig. 15. Multi-block parallel runs on AMD Ryzen Threadripper: the parallel efficiency as
functions of the number of threads nthreads, the stride size δs, the block size dimBlock, and the grid
size dimGrid.
5. Conclusions. A block-based fast marching method with superior sequential
and parallel peformance has been developed in this paper. The computational do-
main is decomposed into non-overlapping blocks that are padded with ghost cells. The
original sequential fast marching algorithm is essentially run on each block. An ex-
changing step is added to update ghost cells using data from neighboring blocks, such
26 J. YANG
dimGrid = 1024
dimGrid = 512
dimGrid = 256
dimGrid = 128
δs = 0.5∆h
δs = 1.5∆h
δs = 2.0∆h
δs = 2.5∆h
δs = 3.5∆h
δs = ∞
1 4 16 64 256
nThreads
dimBlock = 8
1 4 16 64 256
nThreads
dimBlock = 16
1 4 16 64 256
nThreads
dimBlock = 32
1 4 16 64 256
nThreads
dimBlock = 64
1 4 16 64 256
nThreads
dimBlock = 128
1 4 16 64 256
nThreads
dimBlock = 256
1 4 16 64 256
nThreads
0
1
2
3
4
Pa
ra
lle
l E
ffi
ci
en
cy
dimBlock = 512
Case 1
1 4 16 64 256
nThreads
dimBlock = 8
1 4 16 64 256
nThreads
dimBlock = 16
1 4 16 64 256
nThreads
dimBlock = 32
1 4 16 64 256
nThreads
dimBlock = 64
1 4 16 64 256
nThreads
dimBlock = 128
1 4 16 64 256
nThreads
dimBlock = 256
1 4 16 64 256
nThreads
0
1
2
3
Pa
ra
lle
l E
ffi
ci
en
cy
dimBlock = 512
Case 2
1 4 16 64 256
nThreads
dimBlock = 8
1 4 16 64 256
nThreads
dimBlock = 16
1 4 16 64 256
nThreads
dimBlock = 32
1 4 16 64 256
nThreads
dimBlock = 64
1 4 16 64 256
nThreads
dimBlock = 128
1 4 16 64 256
nThreads
dimBlock = 256
1 4 16 64 256
nThreads
0
1
2
Pa
ra
lle
l E
ffi
ci
en
cy
dimBlock = 512
Case 3
1 4 16 64 256
nThreads
dimBlock = 8
1 4 16 64 256
nThreads
dimBlock = 16
1 4 16 64 256
nThreads
dimBlock = 32
1 4 16 64 256
nThreads
dimBlock = 64
1 4 16 64 256
nThreads
dimBlock = 128
1 4 16 64 256
nThreads
dimBlock = 256
1 4 16 64 256
nThreads
0
1
2
Pa
ra
lle
l E
ffi
ci
en
cy
dimBlock = 512
Case 4
Fig. 16. Multi-block parallel runs on Intel Xeon Phi: the parallel efficiency as functions of the
number of threads nthreads, the stride size δs, the block size dimBlock, and the grid size dimGrid.
that earlier computations can possibly be refreshed with any incoming front charac-
teristics. Both the marching and exchanging steps are placed in an infinite loop with
an exit criterion similar to that of the original sequential fast marching method, but
to be satisfied on all blocks. What distinguishes our block-based algorithm from a
na¨ıve low-performance implementation is the restarted narrow band approach, first
introduced in [15] and considerably simplified in this paper, with which the global
narrow band bound is replaced by an incremental one, increasing by a given stride
FAST MARCHING METHOD WITH SUPERIOR PERFORMANCE 27
in each restart. This strategy can greatly reduce excessive refreshing computations
through timely synchronized exchanges. In addition, depending on its value, an ac-
cepted point can be repeatedly updated later, and it may be excluded from the stencil
as an upwind point for updating a neighbor. Especially, a grid point initialized as a
boundary condition is separately labeled, such that it can be inserted into the heap
and treated like other points, but both its value and status won’t be possibly changed
like them. These minor modifications are fully consistent with the original sequen-
tial algorithm. In particular, simple block activation mechanisms are introduced for
both the marching and exchanging steps, so a block will only be included for run-
ning one or both steps when it is involved with the propagating front. Therefore,
such a multi-block algorithm mainly consists of two loops for carrying out the march-
ing and exchanging steps on two groups of blocks, which can be easily parallelized
using OpenMP on a shared-memory multi-core and/or many-core computer. Surpris-
ingly, the whole algorithm does not involve any data race conditions and is free of
OpenMP critical construct or lock synchronization. It is also interesting to note that
two advantageous features of the fast marching method are perfectly retained, i.e., the
monotonic increasing order of the solution and the narrow band formulation, which
makes our new algorithm an ideal benchmark parallel implementation of the original
sequential fast marching method. Detailed pseudo-codes are provided to illustrate the
modification procedure from the single-block sequential algorithm to the multi-block
one in a step-by-step manner.
Four point source problems with various speed functions typical in practical ap-
plications have been systematically tested on four grids ranging from 1283 to 10243
points on two desktop computers using up to 256 threads. Six stride sizes including
the extreme option of δs =∞ were examined and up to eight block sizes were evalu-
ated. The relative differences in the solutions of the new multi-block algoirthm have
been found to be within the order of machine accuracy from those of the baseline
sequential algorithm. Single-block runs have been performed on all grids with the
two multi-block solvers. The multi-block sequential solver showed an overhead of a
few percent, up to 5% on the coarsest grid, and its parallel counterpart showed an
additional overhead of about 1% in average. The number of restarts has been shown
to be directly proportional to the one-dimensional grid size and inversely proportional
to the stride size, thus is entirely different from the number of iterations in iterative
algorithms. Especially, the number of restarts for δs =∞ asymptotically approaches
the same linear relationship with the number of partitions in one dimension on all
grids, yet again verifying its non-iterative nature. Running with a single process,
our new algorithm enabled comprehensively improved performance for essentially all
tests compared with the baseline sequential algorithm, and generally the maximum
speedups (up to 5 on the finest grid) were obtained at a block size of 64. For the
multi-thread algorithm, substantial parallel speedups with all tests were observed,
especially at a block size of 32. In particular, the speedups were up to 50–80 on a
16-core/32-thread computer, and up to 100–220 on a 64-core/256-thread computer,
all on the finest grid. The optimal stride size for the multi-block algorithm was around
2∆h, same as what found in [15], and the performance was found to be very robust
with regard to the stride size. Furthermore, the block size and stride size, as the free
parameters in our new algorithm, barely showed any interdependence in both the se-
quential and parallel runs. The effects of other aspects such as computing platforms
and speed functions on the performance have also been discussed. The extensive
study in this paper has thoroughly demonstrated the striking efficiency, flexibility,
and applicability of the present algorithm.
28 J. YANG
Although beyond the scope of this work, it would be fairly straightforward to im-
plement many extensions such as higher-order upwind schemes, unstructured meshes,
anisotropic speeds, etc., in the framework of current block-based restarted narrow
band approach, because it still strictly follows the original sequential fast marching
method. It would also be interesting to conduct a side-by-side comparison of our al-
gorithm and other algorithms in the literature for solving the Eikonal equations with
regard to algorithm complexity and performance on shared-memory computers as a
future work.
REFERENCES
[1] M. Breuss, E. Cristiani, P. Gwosdek, and O. Vogel, An adaptive domain-decomposition
technique for parallelization of the fast marching method, Applied Mathematics and Com-
putation, 218 (2011), pp. 32–44.
[2] A. Chacon and A. Vladimirsky, A parallel two-scale method for eikonal equations, SIAM
Journal on Scientific Computing, 37 (2015), pp. A156–A180.
[3] M. Detrixhe, F. Gibou, and C. Min, A parallel fast sweeping method for the eikonal equation,
Journal of Computational Physics, 237 (2013), pp. 46–55.
[4] E. W. Dijkstra, A note on two problems in connexion with graphs, Numerische Mathematik,
1 (1959), pp. 269–271.
[5] T. Gillberg, A. M. Bruaset, Ø. Hjelle, and M. Sourouri, Parallel solutions of static
hamilton-jacobi equations for simulations of geological folds, Journal of Mathematics in
Industry, 4 (2014), pp. 1–31.
[6] M. Herrmann, A domain decomposition parallelization of the fast marching method, in Annual
Research Briefs, Center for Turbulence Research, Stanford University, Stanford, CA, 2003,
pp. 213–225.
[7] W.-K. Jeong and R. T. Whitaker, A fast iterative method for Eikonal equations, SIAM
Journal on Scientific Computing, 30 (2008), pp. 2512–2534.
[8] S. Kim, An O(N) level set method for Eikonal equations, SIAM Journal on Scientific Comput-
ing, 22 (2001), pp. 2178–2193.
[9] E. Rouy and A. Tourin, A viscosity solutions approach to shape-from-shading, SIAM Journal
on Numerical Analysis, 29 (1992), pp. 867–884.
[10] J. A. Sethian, A fast marching level set method for monotonically advancing fronts, Proceed-
ings of the National Academy of Sciences, 93 (1996), pp. 1591–1595.
[11] J. A. Sethian, Fast marching methods, SIAM Review, 41 (1999), pp. 199–235.
[12] J. A. Sethian, Level Set Methods and Fast Marching Methods: Evolving Interfaces in Compu-
tational Geometry, fluid Mechanics, Computer Vision, and Materials Science, Cambridge
University Press, Cambridge, 2nd ed., 1999.
[13] J. Tsitsiklis, Efficient algorithms for globally optimal trajectories, IEEE Transactions on Au-
tomatic Control, 40 (1995), pp. 1528–1538.
[14] O. Weber, Y. S. Devir, A. M. Bronstein, M. M. Bronstein, and R. Kimmel, Parallel
algorithms for approximation of distance maps on parametric surfaces, ACM Transactions
on Graphics, 27 (2008), pp. 1–16.
[15] J. Yang and F. Stern, A highly scalable massively parallel fast marching method for the
eikonal equation, Journal of Computational Physics, 332 (2017), pp. 333–362.
[16] L. Yatziv, A. Bartesaghi, and G. Sapiro, O(N) implementation of the fast marching algo-
rithm, Journal of Computational Physics, 212 (2006), pp. 393–399.
[17] H. Zhao, A fast sweeping method for Eikonal equations, Mathematics of Computation, 74
(2005), pp. 603–627.
[18] H. Zhao, Parallel implementations of the fast sweeping method., Journal of Computational
Mathematics, 25 (2007), pp. 421–429.
