Index Terms-Work-time optimal parallel algorithm, SIMD architecture, 1D EDT, 2D EDT, binary operations on GPU.
I. INTRODUCTION
G IVEN a binary image with n × n pixels each of them either white or black, Euclidean Distance Transform (EDT) computes the distance of each pixel to the nearest black pixel which is termed site or feature point. The Euclidean distance transform is useful for a variety of applications including image processing, computer vision, pattern recognition, shape analysis and computational geometry [1] - [3] . Obviously, the 2D EDT can be computed in O(n 4 ) time by an exhaustive brute-force searching algorithm. Many efficient algorithms have been proposed in the past. Breu et al. [4] proposed the first O(n 2 ) time (linear time) exact EDT algorithm based on Voronoi Diagram Intersections. The following two survey papers [5] , [6] compare and contrast many state-ofthe-art sequential approaches for solving the problem in 2D and 3D, mainly focusing on the computation of exact EDT.
Consider a parallel algorithm which solves a problem with input size n in T p (n) time using p processors. The amount of work W (n) performed by the algorithm is defined as the product p × T p (n). The algorithm is termed work-optimal if W (n) ∈ Q(T * (n)), where T * (n) is the running time of the fastest sequential algorithm for the problem. The parallel algorithm is termed work-time optimal [7] if it is work-optimal and, in addition, its running time T p (n) is shortest among all work-optimal algorithms under the same parallel model. A challenge of parallel algorithm design is to produce not only work-optimal but, indeed, whenever possible, work-time optimal algorithms. The modern massively parallel architecture [8] , [9] , which is composed of hierarchical memory and SIMD (Single Instruction Multiple Data) capability, creates the potential for algorithms to be transformed into efficient implementations. Especially, Graphics Processing Units (GPUs), enable solutions to problems impossible with previous computing approaches. When we design parallel algorithms for GPUs, we need to consider not only the complexity of the parallel algorithm itself but also the efficient utilization of the hierarchical memory and SIMD capability of the GPUs [10] , [11] . In this paper, we present a fully-parallelized work-time optimal algorithm for the 2D EDT, which can be implemented efficiently on modern SIMD architectures such as GPUs.
Early attempts for computing the EDT using graphics hardware include the work of Hoff et al. [12] . By rendering a right-angled cone for each feature pixel in the image, the approximation of the distance function is obtained. Then depth-testing graphics hardware is used to compute the distance map. Their method suffers from overdrawing and tessellation error. Sud et al. [13] proposed a method to use the bilinear interpolation equation to compute the distance vector on a polygon. Their method can compute highly accurate distance maps for complex models, but its complexity is dependent on the number of sites in the image. Therefore, it is not suitable for problems with many sites. Later, several efficient GPU-based algorithms which are either work optimal or time optimal have been proposed including JFA [14] , SKW [15] , FastGPU [16] , PBA [17] and Honda's algorithm [18] . However, none of these algorithms are worktime optimal.
In this paper, we present a fully-parallelized work-time optimal exact EDT algorithm which is suitable for implementation on modern SIMD architectures such as GPUs. Following the idea of PBA, the computation of the exact EDT is performed in a dimension reduction manner. The computation is done in one dimension (row wise) first, then in the second dimension (column wise). The algorithm consists of three steps. The first step of the algorithm runs in O(log 32 n) time and performs O(N) of total work on GPU. The second step performs O(N) of total work and has an expected time complexity of O(logn) on GPU. The third step runs in O(log 32 n) time and performs O(N) of total work on GPU. Clearly, this algorithm is a worktime optimal algorithm. As far as we know, this algorithm is the first fully-parallelized work-time optimal algorithm for the GPUs. Experiments demonstrate that this algorithm outperforms prior state-of-the-art GPU algorithms.
II. LITERATURE

A. Exact EDT
Many algorithms for the exact EDT have been proposed. Saito and Toriwaki [19] proposed a sequential algorithm based on dimensionality reduction, since the squared Euclidean distance value is separable, the distance transform can be computed along each principal direction. Recently, Torelli et al. [20] implemented Saito's algorithm on a cluster using MPI. However, Saito's algorithm has a time complexity of O(n 3 ) and it will result in a poor performance in practice. Meijster et al. [21] proposed a two-scan linear-time sequential algorithm which follows the same concepts as described in the Saito's algorithm. Breu et al. [4] also proposed a lineartime sequential algorithm with time complexity of O(n 2 ). Since each pixel of input image needs to be visited at least once, these linear-time algorithms are time optimal. Later, Maurer et al. [22] proposed a linear time sequential algorithm for a binary image in arbitrary dimension using dimensionality reduction strategy. Recently, Wang and Tan [23] proposed a method in which perpendicular bisector line is used to improve the locality of computation. They also extended it to an arbitrary dimension [24] . Lotufo and Zampirolli [25] showed a new way to compute the exact EDT based on grayscale mathematical morphology. In their algorithm, an erosion procedure is repeatedly applied to each image column until the column does not change.
To accelerate the computation of the EDT, various parallel algorithms have been proposed on different parallel models. The fastest parallel algorithm which runs in O(1) time using O(n 3 ) processors was proposed by Datta and Soundaralakshmi in [32] . Their algorithm was originally designed for Reconfigurable Meshes which require each processor capable of communicating with neighboring processors via interconnection buses. Because of the number of processors needed and the special requirement in the architecture, their method is not suited to modern SIMD processors. There are also many algorithms proposed for the PRAM (Parallel Random Access Machine) model. Lee et al. [33] use the dimensionality reduction approach to compute the exact EDT in O(log 2 n) time using O(N) processors on an EREW (Exclusive Read Exclusive Write) PRAM machine. Pavel and Akl [34] proposed an algorithm running in O(logn) time using O(N) processors on a EREW PRAM machine. However, these two algorithms are not work optimal. Considering the computation on each image row, Fujiwara et al. [35] To efficiently utilize hierarchical memory and SIMD capability of modern GPUs, a special SIMD-like programming paradigm has to be employed. However, the PRAM algorithms mentioned above are too complex to fit into such a paradigm. Therefore, there are few algorithms to compute the exact EDT on GPUs using either the sequential algorithms or the PRAM algorithms for exact EDT. Zampirolli and Filipe [29] proposed a GPU implementation of Lotufo's [25] mathematical morphological algorithm. However, in Lotufo's algorithm, an erosion procedure is repeatedly applied to each image column until the column does not change, and it is computationally inefficient. Later, Zampirolli and Filipe [16] proposed a rasterscan based GPU algorithm termed FastGPU algorithm which can efficiently utilize the hierarchical memory of GPUs. They show that the time complexity of the proposed GPU algorithm is O(n 3 / p) where p is the number of available processors. Man et al. [30] proposed a SIMD-like algorithm and implemented it on a Multi-core processor and a GPU, respectively. Since the computation on each image row is performed by single thread, their algorithm is with a higher time complexity of O(n).
Cao et al. [17] proposed an exact EDT algorithm for GPU termed Parallel Banding Algorithm (PBA). By solving several programming issues of GPU including synchronization cost, occupancy and the efficient utilization of texture cache, the PBA achieved a very good performance. As reported in their paper, the PBA outperforms most of the existing GPU algorithms. The main idea of PBA is to partition an image row into several bands and use a single thread to perform the computation on each band independently. Then merge the results of all bands to produce the final results. Their algorithm consists of three steps, in the first step, 1D EDT is computed along the band. In the second step, the bandwise computation is performed to obtain all closest sites for each image row. In the third step, the distance of each pixel to the closest site is computed using the results of step 2. However, as shown in their algorithm, they fail to fully parallelize the first and the third steps of the algorithm. The time complexity of the first step and the third step is O(m) and O(n), respectively, where m is the band size. Especially, the total work the third step performs is O(m K ) where K is the number of closest sites obtained in the whole image. Clearly, in the worst case, it will increase to O(m N) (N = n 2 ). Later, Leung et al. [31] extended PBA to compute the EDT on a large 3D surface by storing each binary pixel as a binary bit. However, they have not improved the complexity of PBA. We have compared Zampirolli's GPU algorithm [16] with PBA, and experiments show that, in most cases, Zampirolli's algorithm is about 3 times slower than the PBA. In [36] , Macedo et al. used the exact EDT to compute the shadow map of 3D virtual scenes, and they also pointed out that the PBA is faster than most of the existing EDT algorithms on GPU. [38] . Chamfer metrics are defined by local masks. The weights of these masks are chosen to minimize the deviation from the EDT. Chamfer DTs require two raster scans in the image. Tuan Q. Pham [39] parallelized the Chamfer distance transform by processing diagonal pixels simultaneously and implemented it on a quad core processors using OpenMP. Danielsson's algorithm generates the EDT in a similar way as the raster scanning of the Chamfer EDT. However, the propagated information is the absolute value of the relative coordinates of the nearest feature pixel, instead of only the relative distances. Therefore the method propagates two values in the masks, instead of one. Many improvements have also been proposed for Danielsson's algorithm [40] - [43] . Yamada [44] proposed a parallel version of Danielsson's algorithm by repeatedly applying a mask on every pixel until no pixel changes its value. This algorithm may be executed on all the pixels in parallel, but is computationally inefficient.
Several GPU algorithms which use Danielsson's vector propagation strategy to compute the approximate distance transform have been presented. Rong and Tan [14] proposed an algorithm termed Jump Flooding Algorithm (JFA) to compute Voronoi map and the EDT on GPUs. The JFA needs O(logn) parallel steps and in each step, the information of a closest site is diffused to 8 pixels in relative position. However, the JFA performs a sub-optimal total work of O(Nlogn) (N=n 2 ), it runs fast only if the image size is small. Later, Yuan et al. [26] improved the performance of JFA by storing a site's ID and coordinate separately. An extension of JFA algorithm [27] was proposed to compute Centroidal Voronoi Tessellation on 3D surfaces. However, these two papers have not improved the complexity of JFA algorithm itself. Schneider et al. [15] proposed a raster scan algorithm named SKW based on Danielsson's approach. The SKW allows concurrent propagation for pixels in the same row or column. Later, the SKW have been applied on different applications including 2D image processing and 3D volume rendering [28] . The SKW can be implemented on the GPU with O(N) of total work and the generated distance transform is close to exact. However, the SKW has a sub-optimal time complexity of O(n), and usually does not run faster than JFA. This is because it can only perform parallel propagation operation in one row (or column) at a time. Clearly, the serial nature of the propagation operation lays a restriction on further optimization of these algorithms.
While these approximate algorithms are adequate for many applications, there are cases for which the exact EDT is needed. For instance, the mathematical morphology dilation operator can be implemented as the threshold of a EDT, as presented in [45] . The occasional errors in the approximate EDT could lead to pixels missing from the dilated object. Thus, morphological closing, a dilation followed by an erosion with the same structuring element, could actually remove pixels from the original object, which is in contradiction with the basic properties of mathematical morphology. Satherley and Jones [46] proposed a vector propagation followed by a correction stage to compute an almost exact distance field.
Recently, Honda et al. [18] proposed a GPU algorithm which performs the vector propagation operation the same as SKW. A correction algorithm is then applied to correct errors caused by the vector propagation. Therefore, Honda's algorithm can compute the exact EDT of an image. Their algorithm also suffers from the sub-optimal time complexity of O(n) as SKW does. Their algorithm is about 2 times faster than PBA when 100 images are processed on GPU simultaneously. Their algorithm achieved this performance by increasing the occupancy of GPU. For single image, if the size of image is larger than or equal to 8k × 8k, their algorithm can achieve a speed up factor of 1.54 compared with PBA. However, if the size of image is smaller than 8k × 8k, their algorithm is slower than PBA. Table I shows the time complexity and total work of related GPU algorithms.
III. 1D EUCLIDEAN DISTANCE TRANSFORM
The 1D EDT is a fundamental operation in the 2D EDT. In this section we describe different parallel algorithms for the 1D EDT on EREW PRAM and GPU.
A. Preliminaries
The 1D EDT can be described as a 1D closest point problem. In 1D space, given a set P = {p 1 , p 2 , . . . , p n } of n points in which points are sorted by their coordinates
we assume there are m (0 ≤ m ≤ n) feature points within P and the distribution of these feature points are unpredictable. The closest point problem of P is to find the closest feature point for every point
If we use '1' to represent the feature point and '0' to represent the nonfeature point, we can obtain a binary sequence. Formally, let I : ⊂ Z 1 → {0, 1} be a binary sequence where = {1, . . . , n}, unless otherwise stated. Hence we have an object O including all the feature points:
Formal definition of the 1D closest point problem is given as:
The closest point problem of a sorted point set P in 1D space is the computation of a sequence C whose value at each point p is its closest feature point q from O, where q satisfies:
It is clear, the 1D closest point problem can be solved by a sequential algorithm in O(n) time. As shown in Fig.1 , all '0's of input sequence represent the non-feature points and all '1's represent the feature points. First we set V (i ) to be '∞' if the corresponding point is a non-feature point, otherwise set Observation 1: If a point q in point set P is the feature point, its closest feature point is itself.
Observation 2: For a non-feature point p in point set P, if a feature point q on the left side has been found and there exists no other feature points between p and q, then q is the closest feature point of p on the left side. The closest point on the right side can be found in the same way.
The first observation means that if a point q is the feature point, we have no need to check other points. The second observation implies that, if a point p is the non-feature point, the search of its closest feature point can start from its nearest point, once the closest feature point found, we have no need to check further points. As shown in Fig.2 , to find the closest feature point of p on the left side, we first check the nearest point p L , it is the non-feature point, then check q, and q is the feature point, then stop checking further points.
B. Parallel Algorithms on EREW PRAM
In this section we describe two parallel algorithms on EREW PRAM. Before describing the algorithm, we define an associative operator • −→ which is the basic operation in our algorithm.
Definition 2: For two variables a i and a j , the associative operator • −→ is defined as follows:
Definition 3: Group operation is defined as follows:
where we call the variables on the left side of operator • −→ as transmitter and the variables on the right side as receiver. An example of the group operation is shown in Fig.3 . The parallelization of the group operation is straightforward.
We describe only the parallel algorithms computing the closest feature point on the left side. The closest feature point on the right side can be computed in the same way. We describe two algorithms on the EREW PRAM, one is time efficient and another is work efficient.
The time efficient parallel algorithm can be described by a tree structure shown in Fig.4 . Each node of the tree is a group operation. In the lower level of the tree the receivers receive the information from neighboring transmitters and in the higher level, receive the information from transmitters further away.
Using n/2 processors, the proposed parallel algorithm can perform the computation in O(logn) time. However, from the tree, the total work of the algorithm is O(nlogn), and therefore it is not work-optimal. The pseudocode of the algorithm is given in Algorithm 1. Now we describe the work efficient parallel algorithm which performs O(n) total work on the EREW PRAM. This algorithm is similar to the parallel prefix sum algorithm proposed in [47] . But here we extend it to calculate 1D EDT. It consists of two phases, the reduction phase and the down-sweep phase, each phase can be described by a tree structure, see To build a balanced binary tree for easy description, we add some empty nodes in the down-sweep tree where no operation is executed. Readers can refer to [47] for more details.
Algorithm 1 Time Efficient Parallel Algorithm
This algorithm can complete the computation in O(logn) time. From the tree structures, the total work of the algorithm is O(n). Thus it is not only time efficient but also work Algorithm 2 Work Efficient Parallel Algorithm 
C. Super Efficient Parallel Algorithm for GPU 1) Multi-Levels of Parallelism on GPU:
CUDA (Compute Unified Device Architecture) enabled GPUs expose three levels of parallelism to users [48] . The first level is represented by Warp. The warp is a group of 32 contiguous threads which execute in SIMD fashion. Threads within a warp are referred as Lanes. Each thread of a warp can have its own local memory called Registers. Other threads have no access to these registers. Threads can exchange data within a warp using the Shuffle intrinsic.
The second level is represented by Thread Blocks, each of which can hold up to 1024 threads in modern GPUs. All threads in a block can access on-chip memory called Shared Memory, which allows them to exchange data at L1-cache speed. However, such data exchanges typically require synchronization. CUDA provides a synchronization intrinsic for this purpose. The shared memory consists of several banks and each bank cannot be accessed by multiple threads simultaneously. This Bank Conflict is an important programming issue in practice.
The third level is represented by Grid, which consists of a number of thread blocks. Threads from different blocks can only communicate via an off-chip memory called Global Memory. If threads from the same warp simultaneously access words in the global memory that lie in the same aligned 128-byte segment, the hardware merges these 32 reads or writes into one coalesced memory transaction that is as fast as accessing a single word. There is no explicit grid-wide synchronization provided. Communication through the global memory is supported by a shared L2-cache. However the size of this cache is very limited.
The following built-in variables are used in CUDA programming relating the thread hierarchy of CUDA:
• warpSize returns the number of threads in a warp.
• threadIdx.x, threadIdx.y, threadIdx.z returns the thread ID in the x-axis, y-axis, and z-axis of the thread which is belonging to a particular CUDA block.
• blockDim.x, blockDim.y, blockDim.z returns the number of threads in a CUDA block in the x-axis, y-axis, and z-axis.
• blockIdx.x, blockIdx.y, blockIdx.z returns the block ID in the x-axis, y-axis, and z-axis of the block which is belonging to a particular CUDA grid. 2) Super Efficient Parallel Algorithm for GPU: Algorithm 2 demonstrates that the 1D closest point problem can be solved similar to the parallel prefix sum algorithm. As shown in Algorithm 2, each element of the sequence receives information from all elements on the left side (or right side). However, Observation 2 shows that, for the 1D closest point problem, each element has no need to receive information directly from all elements on the left side (or right side). The search of its closest feature point can start from its nearest point, once the closest feature point is found, we have no need to check further points. For the 1D closest point problem, we can design a parallel algorithm simpler than the parallel prefix sum algorithm on the GPUs.
Because of the importance of binary operations [49] , CUDA provides some warp-level binary operations for improving GPU's efficiency. These operations allow the threads of a warp to compute cooperatively. The binary operations used in this work are shown as follows. Using these operations we can design a super efficient GPU algorithm for the 1D closest point problem.
• int __ballot (int p) returns a 32-bit integer in which bit k is set if and only if the predicate p provided by the thread in lane k of the warp is non-zero.
• int __clz (int x) counts Leading Zeros: Returns the number of consecutive zero bits beginning at the most significant bit of the 32-bit integer x.
• int __ffs (int x) finds the position of the first (least significant) bit set to 1 in x, where the least significant bit position is 1. • int __shfl (int x, int srclane) copy variable x from the thread indexed by sr clane. We use a simple example to explain the warp operations for computing the closest point on the left side, see Fig.7a . For easy description, we assume warpSize is 8 and Integers just consist of 8 bits. First, each thread reads the corresponding input point, and votes the corresponding bit of a integer variable accordingly, see Fig.7a (readers can imagine that all threads are now operating on the same variable). Each thread sets the corresponding bit of the integer variable as 1 if the corresponding point is a feature point, otherwise set the bit as 0. The voting procedure can be implemented using the ballot() intrinsic of CUDA. The ballot() intrinsic can pack all voting bits from all threads in a warp into a single 8-bit integer variable and returns this variable to every thread. Now every thread holds a copy of the voting variable. Since the left-side closest point must locate on the left of the current point, the higher (warpSize-lane-1) bits of the variable should be masked with 0. For example, for thread T 3 , its lane is 3, thus the higher 4 bits of the variable are masked by 0. Then clz() intrinsic is used to count the leading zeros in the masked value. It is clear the lane of the thread which holds the closest point can be computed by (warpSize-clz()-1). For thread T 3 , intrinsic clz() returns 6 leading zeros, therefore its closest point is held by thread T 1 . Each thread holding a feature point writes the index of the feature point into the corresponding shared memory place, see Fig.7b , the first row. All threads then read the corresponding shared memory place to obtain the leftside closest point according to the thread lane obtained from previous step. CUDA provides some special intrinsics such as shfl() to support the register-level communication between threads of a warp. Using this intrinsic, we can avoid the use of shared memory. Access to the shared memory is slower than access to the register memory. Each thread obtains the closest feature point using the shfl() intrinsic which performs communication between the current thread and the thread which holds the closest feature point. The closest feature point on the right side can be computed by masking the lower (lane+1) bits with 0 and using ffs() instead of clz().
According to Observation 2, each position has no need to receive information from all positions individually on the left side (or right side). Therefore the warp operation can be used in each level of the computing tree of the reduction phase, see Fig.8 . The computing tree is a balanced multi-branch tree. Each parent node has multiple children, where the number of children is equal to the warp size, i.e. it is 32. If we assume all warps of all CUDA blocks can be executed simultaneously, the depth of the tree should be log 32 n. As described in the previous section, we also add some empty nodes in the downsweep tree where no operation is executed in the empty nodes, see Fig.9 . Initially, the empty node is inserted at the root of the tree. On each level, each child node is generated by applying the group operation • −→ to the values from the reduction phase and the values from the parent node. The size of each group is 32, i.e. the warp size.
Algorithm 3 shows the operations inside a CUDA block, readers can extend it for a CUDA grid. The number of threads each CUDA blocks holds is 1024. The code is for finding the closest feature point on the left side. By reading the input data in the opposite direction or using the ffs() intrinsic instead of clz(), the code can be changed for finding the closest feature point on the right side.
If we assume all warps of all CUDA blocks can be executed at the same time, the time complexity of the algorithm is O(log 32 n) and the total work is O(n).
The final closest point is computed by selecting the minimum between the left-side closest point and the right-side closest point. The algorithm needs to pass through twice, once for the computation of left-side closest point and another for the right-side closest point.
One-Pass Closest Point We now introduce a modification that allows us to make just one pass. The right-side closest point can be computed from the left-side closest point directly using this approach. Instead of the mask (Fig. 7a) being the higher (warpSize-lane-1) bits we mask the higher (warpSizelane) bits of the voting variable.
Step One: Figure 10 , row LN Each thread writes the thread index of its closest left neighbor that is not itself using _clz(X & mask) where mask is the higher (warpSize-lane) bits.
Step Two: Figure 10 ,
row RN Each thread checks if it is a black pixel. If it is, get i such that i = L N[thread] or i = 0 if L N[thread] = ∞. Now write its thread number to R N[i ].
For example, for thread T 7 , its left-side closest point is stored in L N [7] = 4. Now T 7 writes its own index 7 to R N[L N [7] ] i.e., R N[4] = 7. There is at most one thread which is holding a black pixel and its nearest point according to LN is at '∞'. This thread will write its own index to R N[0] (see thread T 2 ). Step Three: Figure 10 , row CS Each thread can now evaluate its closest point to both the left and right. If the pixel is black, its closest point is itself. For white pixels, the closest left point is given by L N[thread] . The closest right point
For example, for thread T 8 , its left-side closest point is stored in L N [8] , and its right-side nearest point is stored in R N[L N [8] ]. For thread T 1 , its left-side closest point is at '∞', and its right-side closest point is stored in R N [0] . C S is calculated from the decision on which of the left and right points is closest per thread. The algorithm requires only simple instructions per thread and can be independently processed for each thread using the warp instructions.
IV. 2D EUCLIDEAN DISTANCE TRANSFORM
A. Definition
Given a 2D binary image, its Euclidean distance transform computes the distance of each pixel to the closest black pixel termed site. Formally, let I : ⊂ Z 2 → {0, 1} be a 2D binary image where the domain is convex and, in particular = {1, . . . , n}×{1, . . . , n}, unless otherwise stated. By convention, 0 is associated to white pixel, and 1 to black pixel. Hence we have an object O including all the black pixels:
Formal definition of the 2D EDT is given as: Definition 4: Euclidean Distance Transform of a 2D binary image is the transformation that generates a map D whose value in each pixel p is the smallest distance from this pixel to O:
where the distance metric is Euclidean distance.
B. Computation of Closest Site
The closest site of each pixel in the same column can be obtained by 1D EDT algorithm presented above. The results of 1D EDT provide a set of candidate sites to each row of the image. However not all of them are the closest site of the row. As shown in Fig.11 , each circle dot represents the closest site (black pixel) of each pixel in the same column. By drawing voronoi diagram (blue line) of these black pixels, the dominant We first consider a general case. As shown in Fig.12a , there are three sites a(i 1 
We draw perpendicular bisector line l ab of a(i 1 , j 1 ) and j 2 ) and c(i 3 , j 3 ) . The line l ab intersects with the image row j at point p ab (v, j ) has no dominant interval on the image row. We delete the site b(i 2 , j 2 ) from the candidate set.
We remember that now we are computing the closest site of each pixel of a binary image where all pixels are with integer coordinates. If both intersection points p ab (v, j ) and p bc (u, j ) lay between successive two pixels, see Fig.13 , no matter what position relation both intersection points hold, the site b(i 2 , j 2 ) should be deleted from the candidate set, even v < u, since there is no pixel where its closest site is b(i 2 , j 2 ).
Rule 1: Given an image row j and three candidate sites
The perpendicular bisector line l ab of a(i 1 , j 1 ) and b(i 2 , j 2 ) intersects with the image row j at point p ab (v, j ) and the perpendicular bisector line l bc of b(i 2 , j 2 ) and c(i 3 , j 3 ) at point p bc (u, j ). If v ≥ u then site b(i 2 , j 2 ) has no dominant interval on the image row and it can be deleted from the candidate set.
All closest sites of a row of the image can be obtained by stack operations. The push and pop operations of the stack is based on Rule 1.
Let p(v, j ), q(u, j ) be two pixels of image row j where v < u . If site c(i 1 , j 1 ) and c (i 3 , j 3 ) are the closest site of p and q respectively, then we have i 1 ≤ i 3 . The order of all dominant intervals is exactly same with the order in which the corresponding closest sites appear. Based on this, a binary search operation can be used to reduce the amount of stack operations. As shown in Fig.14a, there 5 sites c 1 , . . . , c 5 already in the stack and now we want to add c 6 to the stack. We first check the top of the stack, that is c 5 , whether it can be deleted from the stack or not, see Fig.14b . If c 5 can be popped from the stack, then the forward binary search is used to find the next site c 3 , and check it, see Fig.14c . If c 3 can be popped from the stack, then the forward binary search is applied again to find the next site to check (it means all candidate sites between c 3 and c 5 can be popped from the stack without checking). Otherwise the backward binary search is applied to find the next site to check. The combination of the binary search and stack operations is applied until there is no site can be popped further, see Fig.14d .
Rule 2: Assume there k sites c 1 , c 2 , . . . , c k already exist in the stack and now we are adding a new site c h to the stack. If the top of the stack, c k , can be popped from the stack, the next site to check is c k/2+1 whose index can be obtained by the binary search.
However, the performance of stack operations with the binary search depends on the number of closest sites among all candidate sites. If the number of closest sites among all candidate sites is small, then the stack operations with the binary search can achieve a high performance. Otherwise, the performance of stack operations will decrease significantly because a large number of binary searches will be performed.
Now we consider the merge of two independent stacks one named S 1 and another named S 2 . The merge of two stacks is an important operation in the second step of our parallel algorithm. We add sites from the bottom of S 2 (we call it the sending stack) to the top of S 1 (we call it the receiving stack). It is clear, during merging procedure, once there are two sites on the top of S 1 that are from S 2 , and neither of these two sites can be popped from S 1 further, then we can add the rest of S 2 to S 1 without checking. However, in practice, data movement is with heavy cost. Therefore, to avoid moving data, a linked list can be used to connect two stacks to create a new larger stack, as proposed in [17] . Then the rest of S 2 has no need to be moved to S 1 .
Rule 3: We merge two independent stacks S 1 and S 2 by adding sites from the bottom of S 2 to the top of S 1 . If there are two sites from S 2 in the top of S 1 , and neither of these two sites can be popped from S 1 further, then we can add the rest of S 2 to S 1 without checking.
C. Parallel Algorithm for 2D Euclidean Distance Transform
We describe a fully-parallelized work-optimal 2D EDT algorithm in this subsection.
Step 1: Compute the closest site of each pixel in the same column using the 1D EDT algorithm. The results of 1D EDT will be stored in a 2D array named R 1D .
Step 2: Compute closest sites for each row of the image using the stack operation.
Step 2.1: The creation of the stacks. Partition each row of R 1D into n/3 groups each with 3 elements. Each processor performs the stack operations to the corresponding group following Rule 1.
Step 2.2:
The merge of the stacks. Merge every two stacks using a processor following Rule 2 and 3. In the merging procedure, we add sites from the bottom of one stack to the top of the another. Repeat the merging procedure until one stack left.
Step 3: For each pixel, find the closest site and compute the distance between them.
Step 3 
D. GPU Implementation
Considering the coalesced access of the global memory of the GPU, Step 1 is implemented along each row of the input image. The One-pass algorithm described in Section III-C.2 is employed. If the shared memory is capable of storing the whole image row, then the global memory just needs to be accessed twice, one for read and another for write.
In
Step 2, the output of Step 1 is partitioned into many tiles and a CUDA block is assigned to a tile, see Fig.16a . Each CUDA block reads the global memory along column and creates the stacks in the shared memory. It is clear the access to the global memory is coalesced. The creation of stacks in the shared memory is also without bank conflicts. Each CUDA block writes the created stacks back to the global memory in a transposed location, see Fig.16b . For example, CUDA block{0, L} reads the left-bottom corner of the input array, see Fig.16a , and writes the created stacks back to the righttop corner of the output array, see Fig.16b . For this specific step, the read of the shared memory has a bank conflict. However, it still can achieve a better performance, since an explicit matrix transpose [30] is avoided. The binary search on each stack is not applied. Since experiments show that, in most cases, the use of binary search will lead to a poor performance. To make a trade-off between the creation cost and the merging cost of stacks, the size of initial stack should be larger than 3. The merge of stacks are implemented in global memory directly. For each stack, we use two pointers, one indicates the bottom of the stack and another the top of the stack. During the stack merging procedure, we just move these two pointers. The merge of two stacks terminates once two sites are added to the receiving stack, and are not popped from the receiving stack further, see Rule 3. This avoids the addition of the rest of the sending stack to the receiving stack, as the GPU is sensitive to memory operations.
Step 3, one thread warp is assigned to one stack to compute the perpendicular bisector lines, where the stack is indicated by a pair of pointers (remember that the final stack consists of many small stacks each of them is indicated by a pair of pointers, here we assign a thread warp to a small stack). The 1D array mentioned in Step 3.1 is allocated in the shared memory. Finally we transpose the results of Step 3 using the shared memory following the algorithm of [50] . Now we give a brief discussion on how to implement the algorithm outside CUDA, for example in OpenCL (Open Computing Language) [51] or Vulkan [52] . The main programming issues would be in the GPU implementation of Step 1. OpenCL provides equivalent functions of clz() and shfl() of CUDA, but the current version does not provide an equivalent function to ballot(). User can emulate ballot() using the atomic_or(oper and1, oper and2) function of OpenCL, which performs bit-wise or-operation on two operands oper and1 and oper and2 and stores the result in oper and1. These two operands are unsigned integers in our work. The oper and1 should be in Local Memory of OpenCL (which is equivalent to the shared memory of CUDA). For each thread, the corresponding bit of oper and2 is set to 1 if the current thread holds a black pixel, otherwise 0. Other bits of oper and2 are set to 0. Since each thread needs to wait for all other threads to finish the atomic operation, there should be a synchronization between atomic_or() and clz(). Vulkan is a graphics and compute API that provides low-overhead, crossplatform access to modern GPUs. Latest version of Vulkan is integrated with SPIR-V (Standard Portable Intermediate Representation) [53] which is an intermediate language for parallel compute and graphics by Khronos Group. SPIR-V (version 1.3) provides a corresponding function to ballot(). SPIR-V Extended Instructions for GLSL (OpenGL Shading Language) [54] provides a function FindUMsb() which finds the most significant bit in a unsigned integer. SPIR-V also provides an equivalent of the shfl() function.
E. Complexity Analysis
We now analyze the complexity of the GPU implementation of the algorithm. We assume all CUDA warps of all CUDA blocks can be executed simultaneously.
Fact 1:
Step 1 takes O(log 32 n) time and O(N) of total work on GPU.
Proof: The height of the computing tree shown in Fig.8  and 9 is log 32 n, it yields the time complexity claimed. In each level of the computing tree, the number of threads computing is n/32 m where m (0 ≤ m ≤ log 32 n − 1) is the height of current level of the tree. Therefore the total work of Step 1 is O(N)(N = n 2 ).
Fact 2: The expected time complexity of Step 2 is O(logn).
The total work Step 2 performs is O(N).
Proof: The height of the merging tree is O(logn). If the distribution of closest sites among candidate sites allows all processors to perform the same amount of work, then
Step 2 will run in O(logn) time. It yields the time complexity claimed.
For an image row, suppose there are k(k ≤ n) candidate sites obtained and k non (k non < k) sites among them are not the closest sites of the image row. These k non sites will be popped from the stacks on each level of the merging tree. Suppose the number of sites which will be popped from the stacks on i th (0 ≤ i ≤ logn − 1) level of the merging tree is k i non , we have
When we add (push operation) one site, which is from the sending stack, to the receiving stack, it may cause one or more pop operations on the receiving stack. On the other hand, we remember that the merge of two stacks will terminate once there are two sites in the receiving stack that are from the sending stack and they cannot be popped from the receiving stack further. It means that, if k i non pop operations are executed in i th level of the merging tree, the number of push operations executed is at most k i non + 2 × n/2 i , where n/2 i is the number of threads computing in i th level of the merging tree. These threads will add (push operation) at least two sites into the receiving stack. Therefore the total number of push operations in all levels of the merging tree is:
Step 3 takes O(log 32 n) time and O(N) of total work on GPU.
Proof: The intersection points between the image row and the perpendicular bisector lines of each pair of closest sites can be computed in O(1) time and O(n) of total work on GPU. The right-side closest point problem can be solved in O(log 32 n) time and O(n) work on GPU (as proved in Fact 1). It yields the complexity claimed.
F. Evaluation
The evaluation is performed on a machine with an Intel Core i7-6700 CPU (3.40GHz) and a NVIDIA GeForce GTX1080 graphics card with 8GB of video RAM. The programming environment is Visual Studio 2015 with CUDA 8.0 running on Windows 10. 
1) Comparison With PBA Step by
Step: Both PBA and our algorithm consist of the same steps. We compare two algorithms step by step varying the density of sites. Examples of input images are shown in Fig.17 . We use the original source code of PBA which is available online at PBA's project web site for comparison. Table II shows the running time of each step of two algorithms. The size of input image is 8k × 8k. As shown in the table, the speedup factor of Step 1 is always more than 2, since the time complexity of Step 1 of our algorithm is only O(log 32 n). Also, Step 1 of our algorithm is implemented by the one-pass approach which just needs to go through the data once, see Section III-C.2 and Section IV-D. To access the global memory with coalescing, a matrix transpose is needed in PBA following Step 1. However, our algorithm has no need to perform this transpose since the transpose is hidden in
Step 2, see Section IV-D. There is no significant difference between two algorithms in Step 2, our algorithm is slightly faster than PBA. The performance of PBA is dominated by
Step 3 and the main difference between two algorithms also appeared in this step. Since, the PBA needs O(n) time in this step, versus O(log 32 n) time of our algorithm. Finally, both algorithms perform the matrix transpose again to get correct results.
2) Comparison With PBA Varying Geometric Structure of Input Image: Both PBA and our algorithm are very sensitive to the content of the input images. Therefore we use images with different geometric structure to test them. Images used in experiments are shown in Fig.18 . The first three images contain different geometric shapes. These three images are chosen since geometrical characters of the input image may impact the performance of algorithms. The last three images contain real objects. We choose these three images since which have been universally used as impartial benchmarks for image analysis algorithms. The size of the images is 1k × 1k. Our algorithm always outperforms the PBA, see Fig.19 , no matter how complex the input images are. 
3) Comparison With Other Algorithms:
In the absence of source code availability from Honda [18] , we compare our work with JAF, PBA and Honda's algorithm using the data presented in [18] which was created by testing the various approaches on the same GPU (NVIDIA GeForce GTX 1080). The following four images are used in the experiments: an image with 0.01% of black pixels (the location of black pixels is chosen randomly), an image with 1% of black pixels, an image with 50% of black pixels and the image of Lena. The size of image varies from 512 × 512 to 16k × 16k. As shown in Table III , for smaller images, the JFA can achieve a better performance comparing with PBA and Honda's algorithm. The PBA is faster than JFA and Honda's algorithm for the images with the size smaller than 8k × 8k. Honda's algorithm can achieve a better performance for the images with size larger than or equal to 8k × 8k. Our algorithm outperforms all three algorithms in every case, see Table III . The maximum speedup factor is 4.67, 2.86 and 6.64 compared with JAF, PBA and Honda's algorithm, respectively.
V. CONCLUSION
In this paper, we present a fully-parallelized work-time optimal exact EDT algorithm. Compared to the existing PRAM algorithms and other algorithms, this algorithm is suitable for implementation on modern SIMD architectures such as GPUs. As a fundamental operation, 1D EDM is efficiently parallelized first. Three different algorithms are proposed for the 1D EDT on the EREW PRAM and GPU. Specifically, the GPU algorithm for the 1D EDT, which uses CUDA binary functions such as ballot(), ffs(), clz() and shfl(), runs in O(log 32 n) time and performs O(n) work. Using the 1D EDT as a fundamental operation, we propose a fully-parallelized work-time optimal 2D EDT algorithm. This algorithm consists of three steps. The first step of the algorithm runs in O(log 32 n) time and performs O(N) of total work on GPU. The second step performs O(N) of total work and has an expected time complexity of O(logn) on GPU. The third step runs in O(log 32 n) time and performs O(N) of total work on GPU. Clearly, this algorithm is a worktime optimal algorithm. As far as we know, this algorithm is the first fully-parallelized work-time optimal algorithm for the GPUs. Experiments show that this algorithm outperforms most of the state-of-the-art GPU algorithms. Currently, this algorithm is the fastest 2D EDT algorithm with the lowest complexity on GPUs.
