Connected component labeling is an important but computationally expensive operation required in many fields of research. The goal in the present work is to label connected components on a 2D binary map. Two different iterative algorithms for doing this task are presented. The first algorithm (Row-Col Unify) is based upon the directional propagation labeling, whereas the second algorithm uses the Label Equivalence technique. The Row-Col Unify algorithm uses a local array of references and the reduction technique intrinsically. The usage of shared memory extensively makes the code efficient. The Label Equivalence algorithm is an extended version of the one presented by Hawick et al. (2010) [3] . At the end the comparison depending on the performances of both of the algorithms is presented.
Introduction
Connected component labeling is an important problem appearing in different fields of research. Although one can consider this problem as a general one, namely arbitrary graph component labeling or coloring, often the specific task of labeling connected components on a grid is of great interest. This is needed for example in computer vision as a part of segmentation, namely separating the objects from the background. It reduces the image to a binary representation, in which objects are represented by 1's and the background by 0's. Another field of application are cellular automata (CA) models used for different kinds of simulation in physics, mathematics and biology.
Since the early 1970s, numerous approaches for connected component labeling have been introduced [4, [8] [9] [10] . Most of these approaches are suitable for sequential processing, but also some parallel algorithms have been developed [5, 1] .
The use of GPUs with interfaces as CUDA [6] or OpenCL [7] opens a new perspective for many data processing approaches. The problem of graph component labeling with GPUs has already been addressed by Hawick et al. [3] .
In the current work, we present two algorithms for connected component labeling on a 2D binary grid. For our implementations, we use CUDA and measure the performances of the two algorithms on an NVIDIA TESLA C1060. The first algorithm for directional propagation labeling uses the reduction technique. For the second algorithm, we improve and extend the implementation by Hawick et al. [3] , which allows one to reduce the total memory usage and simplifies the original procedure. We demonstrate that our implementation can be easily extended to compute different connected component characteristics such as area or perimeter. This is of interest, e.g. in cases where binding energies of cluster atoms change with cluster sizes.
The paper is organized as follows. In Section 2, the two algorithms are described. Section 3 presents the comparison of the performances. In Section 4, we conclude the paper.
Algorithm description
Connected component labeling in our framework is the assignment of a unique label to each non-zero element on a 2D grid in such a way that all non-zero neighbors get the same label. In this work we consider 2D cases with four neighbors (north, south, east and west).
Row-Col unification
The first algorithm is similar to a ''kernel C'' algorithm described in [3] . This method implements the directional propagation labeling. In the initial ''kernel C'' approach, each thread is responsible for the whole row or column. In our implementation, we intrinsically apply the reduction technique, which is well known in parallel computing [2] . This technique allows one to obtain integrated characteristics of a bunch of data, for example, the sum of array elements. Additionally, in our implementation we use shared memory and a local array of references. In the following section, we describe the main steps of the algorithm.
The main idea of the algorithm is the propagation of the label with the smallest value. The procedure has to initialize all nonzero elements with unique identifiers which are equal to the value of the corresponding index of the element in the label array. Thereafter, the smallest label value propagates along all rows and then columns (see Fig. 1 ). Making several iterations of such a label propagation procedure, the algorithm finally marks all elements in each connected component with the smallest label value of this connected domain. for other sizes was not done due to the fact that this algorithm is much slower than the second one (see Section 3). The algorithm is iterative and consists of two steps. The host part is described in pseudo-code in Algorithm 1. First, the label array (mapLabels Dev) is initialized with unique identifiers. Then, two unification procedures for rows (UnifyRow) and columns (UnifyCol), respectively, are called within the while loop. Each procedure requires two parameters: the first one is the label array mentioned above; the second one is an integer indicator for the loop termination. If UnifyRow or UnifyCol do not set this indicator to 1, the while loop is terminated. This is possible due to the fact that the operation of value assignment does not produce any synchronization problems for CUDA devices in this case. The main routine kernel is described in Algorithm 2. It is identical for both row and column unification. The procedures for row and column processing differ from each other only in the way they access the global memory. The grid for this kernel is constructed in the following way: threads from the same block access the same row or column. The layout of the global memory is orga-nized in such a way that the access for rows is coalesced whereas for columns it is not. In the routine, the label array is copied first from the global to the shared memory, where also the local array of references is allocated. We add one more element to the front of the label and reference arrays to avoid unnecessary conditioning during the update procedure. Secondly, we apply an initial unification procedure which is the first step of reduction. This is described in Algorithm 3. Each two neighboring elements are processed in each thread. If both elements in the label array are non-zero, then the one that has the smallest value is propagated, i. e., this value is assigned to both array elements. The references of these elements are initialized according to the indices of the elements: the one with the larger index value references the one with the smaller index value which has a reference to itself. This is shown in Fig. 2 .
The ContinueInd indicator is set to 1, if two neighboring nonzero elements that are not equal to each other are found.
Then, a cascade of reduction and update steps is executed. Each subsequent step in the cascade requires half the numbers of active threads for the UnifyRowL reduction procedure compared to the previous step. However, it uses all threads in a block for an Update procedure. In Fig. 3 , an example of one (third) step of reduction is shown. As mentioned before, in the first reduction step each thread processes two neighboring elements. In the subsequent steps, the number of neighboring elements processed by each thread doubles, i.e., in step 2 the number of elements is 4, in step 3 the number of elements becomes 8 and so on.
In Fig. 3 , three groups of labels and references which are processed by three threads are shown. The separation of groups is indicated by thicker lines. Each thread checks the values of the two central elements of the corresponding groups (yellow). If they are non-zero, the minimum value of these two elements is assigned to the element which is referenced by the element with the smaller index value from the processed pair. The references are taken from the reference array. The references of the processed pair are also updated, i.e., the element with the larger index value receives the same reference as the element with the smaller index value. The updated values in both arrays are marked in red. A detailed description of the UnifyRowL procedure is given in Algorithm 4. After the procedure UnifyRowL is finished, all threads in a block must be synchronized and the Update function is called in order to complete the reduction step. Here, all threads update both labels and references for two processed neighboring elements, i.e., each element obtains the corresponding label and reference from the element to which it currently points. Algorithm 5 summarizes the Update procedure. return The weak point of this procedure is the bank conflicts which lead to the serialized requests for the reading of the value needed for the update. This gives rise of the performance drop for the whole procedure. 
Hoshen-Kopelman or Label Equivalence
The second presented algorithm is similar to the well-known Hoshen-Kopelman [4] algorithm. Recent publications by Suzuki et al. [10] and Wu et al. [11] give a nice description of the label equivalence procedure for the labeling of connected components in a binary image in the sequential case.
The parallel version of the Label Equivalence algorithm for GPUs has been presented by Hawick et al. [3] .
According to their description, the multi-pass algorithm consists of three phases which are repeated in a loop: scanning, analysis, and labeling. The first phase constructs a forest of references, the second one connects all references in each tree to the root, and the third one assigns the corresponding labels according to the references.
Our implementation is significantly improved compared to the algorithm presented by Hawick et al. [3] in terms of memory consumption: there is no need for an additional reference array. Apart from that, it requires less steps: the labeling phase is omitted. Other features are the usage of padding which allows us to avoid extra conditioning for the border elements. Atomic operations which slow down the computation dramatically due to their synchronous nature are also omitted in the scanning phase. This is possible due to the iterative nature of the algorithm: if collision happens, it will be resolved during the next iterative step. Fig. 4 demonstrates the first step of the algorithm as well as usage of the padding. Here, the left picture shows initial labels with the value equal to the array index, starting from the left top added element. The right one shows labels after the first step of the algorithm. Considering labels as references one can find the forest consists of 8 trees (for convenience they are depicted with different colors). After the second step values of all the labels in those trees will be the same and equal to the corresponding root's label value, i.e., 20 for the gray region, 31 for the orange, 41 for the blue etc.
Listing 1 shows the initialization procedure. 'UINT' stands for the 'unsigned int' type. 'SIZE*' and 'SIZE*PAD' are definitions of macros for the size of the working area and the whole area including padding, respectively. The label array must be filled with a binary image. The procedure initializes all non-zero elements with unique identifiers which are equal to the value of the corresponding index of the element in the label array. This allows us to use these values later as references.
After initialization of the label array, the main loop for the iterative labeling of connected components starts. Two functions are called on each step: ''Scanning'' and ''Analysis''. The first function represents the first phase of the algorithm, namely the linking of the elements. This is done by choosing the smallest nonzero label value within five elements: one central and its neighbors (see Listing 3). As labels with zero value are ignored during the procedure, constructed trees never connect to the background and, therefore also with each other. As a consequence of the locality of the procedure, the algorithm works for any number of disconnected components.
The indicator IsNotDone is only set to 1, if the label value is changed. This is done in order to stop execution of the algorithm when no further iterations are needed.
The second function represents the relabeling phase of the algorithm. Here, in a while loop each thread takes a sequence The loop is terminated when the label with the value that coincides with the index of that element is found (see Listing 2). This procedure works remarkably fast due to the fact that the steps which are performed previously make the later ones shorter. Fig. 5 gives an example of such a situation. Here, the root element is shown in gray. All labels are linked to it if they are considered as references. For example, if for an element with index 6 (in blue) the relabeling was already done. Then, for an element with index 10 the sequence is shorter, because after processing element 6 the first element will be taken directly.
Benchmarks and discussion
To measure the performance of the two algorithms presented before several tests have been done. All tests presented in the following section are run on an NVIDIA TESLA C1060. Table 1 summarizes the results for images with such a size and different topological and occupational settings of the image. Under topological settings we understand different shapes of the connected components, whereas occupational settings refer to the fraction of the connected components to the image size. Here, ''RC'' stands for the ''Row-Col Unify'' and ''LE'' for ''Label Equivalence'' algorithms. In Table 1 , three different cases of the image filling are considered: spiral (see [3] for the details), random, and blob with random. The first one is traditionally considered to be a particularly complex case, whereas the second and the third ones are of great interest for tasks like cellular automata (CA).
For the CA procedure, one usually starts with a random distribution of the occupied cells and follows their dynamics which often results in some blobs of a certain size and a randomly distributed small noise. We compare two cases for the random cell distribution with occupation ratios of 0.5 and 0.1, which means that half and 10% of the image cells are occupied, respectively.
For the ''blob with random'' distribution we consider four cases with different blob sizes (C1-blob radius 10, C2-20, C3-50, C4-100). The occupation ratio is kept fixed at 0.5. Half of the occupied cells are assigned to blobs and the rest are randomly distributed. This choice of fixed occupation ratio and different blob sizes result in different numbers of blobs for different cases.
For the ''Label Equivalence'' algorithm, the performance of the extended version of the procedure (named in Table 1 ''SZ'') as well as that of the original one (named in Table 1 ''NSZ'') is measured.
To demonstrate the flexibility of our extension of the ''Label Equivalence'' algorithm in addition to the pure labeling the size of each connected component is also calculated, which is needed e.g. for calculating cluster-size dependent binding energies in CA applications. For that purpose another array _CSize is used. This array must be initialized with 0's in non-occupied cells and with 1's in occupied cells. The only part which has to be modified is the ''Analysis'' function. Needed modifications are shown in Listing 4. The presented code has to be added after the assignment of the label at the end of the function. Here, after reconnecting the element with the root of the tree of references the size of the root is incremented by the size of the current element, which afterwards is set to 0. In order to synchronize the summation atomic operations are used. As a result of calculations one gets an array with connected component sizes stored in the element with the smallest index for each component. Other extensions to the algorithm can be done in a similar manner. Table 1 shows that the ''Label Equivalence'' algorithm is faster in all cases. Blob sizes do not affect the speed of the second algorithm. The occupation ratio, on the other hand, influences the speed significantly.
The case with a spiral distribution is extremely difficult to handle for the first algorithm and very easy for the second one. It is interesting to mention that our modification of the ''Label Equivalence'' algorithm needs only 3 iterations for such a distribution regardless of the size of the map. This shows that the number of iterations needed for the second algorithm depends strongly on the topology of the connected component and only weakly on their size. Table 2 shows the change of the performance of two modifications of the ''Label Equivalence'' algorithm with respect to different occupation ratio for the case of 2048 × 2048 grids. Here, the maximum number of iterations, and therefore calculation time, is needed for the case of an occupation ratio of 0.5-0.6. For other occupation ratio values the time consumption is less, although for the case of high occupation ratio it drops slower than for the lower ones. The reason for this is the first condition checking for the nonzero elements. Table 3 shows the performance of the ''Label Equivalence'' algorithm with different sizes of the map. The execution time increases linearly with the total number of cells. Table 4 shows the performance of the second algorithm with respect to the block size of the kernel. A significant drop of the performance occurs if the block size gets smaller than the warp size (32 in our case). The result is expected because the multiprocessor schedules and executes threads in groups of 32 parallel threads (see [6] ).
Conclusions
Two type of iterative algorithms for labeling connected components on a 2D binary grid were described. The first one is a ''Row-Col Unify'' algorithm which implements the directional propagation labeling technique into CUDA. The second one is the modified version of ''Label Equivalence'' implemented by Hawick et al. in [3] .
It was shown that there are two major advantages of the ''Label Equivalence'' algorithm over the ''Row-Col Unify'' one. The first advantage is the simpler implementation which leads to much less instructions. The second one is concerned with a reduced number of iterations needed for the procedure to expand the smallest label on a whole connected component. In the case of a spiral distribution of occupied cells for a 1024 × 1024 grid the number of iterations needed was 514 for the ''Row-Col Unify'' algorithm and 3 for ''Label Equivalence''. This demonstrates that the productivity of the second algorithm depends on the topology weaker than the productivity of the first one.
In general, the second algorithm is 15 ∼ 35 times faster compared with the first one. Another advantage of the ''Label Equivalence'' algorithm is its capability to be easily extended to calculate additional integral characteristics of each connected component, like size, perimeter or area.
Appendix. Label Equivalence procedure listings
See listings. 
