Connected Component Labeling (CCL) is a fundamental algorithm in computer vision, and is often required for real-time applications. It consists in assigning a unique number to each connected component of a binary image. In recent years, we have seen the emergence of direct parallel algorithms on multicore CPUs, GPUs and FPGAs whereas, there are only iterative algorithms for SIMD implementation. In this article, we introduce new direct SIMD algorithms for Connected Component Labeling. They are based on the new Scatter-Gather, Collision Detection (CD) and Vector Length (VL) instructions available in the recent Intel AVX512 instruction set. These algorithms have also been adapted for multicore CPU architectures and tested for each available SIMD vector length. These new algorithms based on SIMD Union-Find algorithms can be applied to other domains such as graphs algorithms manipulating Union-Find structures.
INTRODUCTION
Connected Component Labeling (CCL) is a central algorithm between low-level image processing (filtering and pre-processing) and high-level image processing (scene analysis, pattern recognition and decision). CCL consists in providing a unique number to each connected component of a binary image. There are several applications in computer vision that use CCL such as Optical Character Recognition, motion detection or tracking.
CCL algorithms are often required to be optimized to run in real-time and have been ported on a wide set of parallel machines [1] [13] . After an era on single-core processors, where many sequential algorithms were developed [7] and codes were released [2] , new parallel algorithms were developed on multicore processors [15] [6] .
The majority of CCL algorithms for CPUs are direct, by opposition to iterative ones -where the number of iterations (the number of image pass to process it) depends on the image structure -and require only 2 passes thanks to an equivalence table.
The first algorithms designed for GPUs [10] [9] [5] and SIMD [19] [12] were iterative. Furthermore, the number of iterations required to reach a solution cannot be predicted beforehand and can reach a large number. In recent years, some direct algorithms for GPUs were designed and are better suited for real-time image processing in embedded systems [11] [3] [16] [8] .
In this article, we introduce a new direct CCL algorithm for SIMD processors which relies on scatter-gather operations. These algorithms are only available using Intel AVX512 extension and the upcoming ARM/Fujitsu SVE. Section 2 explains algorithmic design and the required SIMD instructions. Section 3 describes the classic Rosenfeld algorithm [18] and its derivatives, as well as concurrency issues that can appear using SIMD. Section 4 introduces two new SIMD algorithms with implementation details. The first one is based on a classic pixelbased approach, while the second one is based on sub-segments and uses bit manipulation and conflict detection to reduce memory accesses. Section 5 presents the benchmark protocol for reproducible experiments, the results and their analysis.
ALGORITHMIC DESIGNS TO EXPLOIT ARCHITECTURES
As clock frequencies of modern processors are expected to stay near their current levels, or even to get lower, the primary method to improve the computational capability of a chip is to increase either the number of processing units (cores) or the intrinsic parallelism of a core (SIMD). In this work, we target the most recent Intel architecture with the AVX512 SIMD extension set. The AVX512F extension has added the scatter instruction for each available vector size (128, 256, 512) which is required to develop an SIMD Rosenfeld algorithm. We have used the instructions compress and expand which are available in the AVX512VL extension (algo 8). From AVX512CD, we also used the conflict detection (CD) and lzcnt (count leading zeros) instructions (algo 11).
Each of these instructions are required to implement efficient SIMD CCL algorithms and are only available in the recent Intel AVX512 extension set. Developing new algorithms that use the architecture efficiently requires heavy algorithmic modifications and can only be done efficiently if the correct instructions are available. Due to this complex process, it is impossible for a compiler to provide any vectorization support. Domain Specific Languages (DSLs) such as Halide [17] would also require additional modifications from the DSL and the application programmer due to the new available instructions.
In this context, we developed a set of AVX512 C++ templates code that can be instanciated in 128-bit, 256-bit and 512-bit in order to evaluate the impact of SIMD width on the performance. The required sub-sets of AVX512 are AVX512F, AVX512CD and AVX512VL. We have further extended our portable template code to support the new ARM SVE instruction set but the code has not been tested in the scope of this paper. 
CLASSIC ALGORITHMS
Usually, CCL algorithms are split into three steps and perform two image scans (like the pioneer algorithm of Rosenfeld [18] ). The first scan (or first labeling) assigns a temporary/provisional label to each connected component (CC) and some label equivalences are built if needed. The second step solves the equivalence table by computing the transitive closure (TC) of the graphes (in fact a forest) associated to the label equivalences. The third step performs a second scan (or second labeling) that replaces temporary labels of each CC by its final label. Figure 1 defines some notations and gives an example of a classic Rosenfeld algorithm execution. Let p x , e x , the current pixel and its label. Let p a , p b , p c , p d , the neighbor pixels, and a, b, c, d, the associated labels. T is the equivalence table, e a label and r its root. The first scan of Rosenfeld is described in algorithm 3, the transitive closure in algorithm 4, while the classical union-find algorithms are provided in algorithms 1 & 2. In the example ( Fig. 1 ), we can see that the rightmost CC requires three labels (1, 2 and 4) . When the mask is in the position seen in figure 1 (in bold type), the equivalence between 2 and 4 is detected and stored in the equivalence table T . At the end of the first scan, the equivalence table is complete and applied to the image (like a Look-Up- Table) . 
Algorithmic and SIMD optimisation
Decision Tree (DT) based algorithms for CCL [20] have been proved to be very efficient to enhance scalar implementations. The DT reduces the number of Union and Find function calls to the strict minimum, i.e. when there is an equivalence between two different 
if (r a 0 and r a e x ) then U nion(e x , r a , T ) DT is more efficient than path-compression as it reduces the number of memory accesses. Considering the classical implementation of the Rosenfeld algorithm (algo. 3), there are four calls to Find and one to Union. The calls to find in the union can be omitted because the input labels are already equivalence trees' roots. Using a DT (algo. 5), there are at most one call to Union and two calls to Find as we have at most one equivalence between labels.
If we consider the extended topology of the mask in the SIMD Union Find (Fig. 4 ), there are, for an SIMD vector of cardinality card, at most card/2 + 1 different labels, card/2 calls to Union, and thus, card calls to Find, with card/2 − 1 which are redundant. If we take a vector with a cardinal of 8, we get 1/2 Union per pixel and one call to Find. In the classical Union Find approach, we first perform a call to U nion(e 0 , e 1 ) which calls Find(e 0 ) and Find(e 1 ). Then, we call U nion(e 1 , e 2 ) which calls Find(e 1 ) and Find(e 2 ), and so on, until a call to U nion(e 3 , e 4 ) which calls Find(e 3 ) and Find(e 4 ). As there can be more than one Union in an SIMD vector, concurrency issues can appear. 
Algorithmic and SIMD concurrency issue
Depending on the neighboring labels of a pixel, concurrency issues can appear while performing simultaneous union operations. In the second row, there is one CC on the left with twice the same equivalence (T [4] ← 1, T [4] ← 1) which is redundant but not an issue. However, on the right happens a concurrency, as we have 4 ≡ 1 and 4 ≡ 2. A possible correction could be 2 ≡ 1 (T [4] ← 1, T [2] ← 1). Note that this problem can happen starting from a 3pixel wide SIMD register if the two patterns are merged. The SIMD implementation of Union Find has to address these concurrency issues, which is -by far -non trivial to solve and to implement efficiently in SIMD.
NEW ALGORITHMS 4.1 SIMD Union-Find
The biggest challenge when designing an SIMD CCL algorithm is to design a fast and concurrency-free union-find algorithm to manage equivalences. The union algorithm must take into account the conflicts from multiple equivalence tree roots involved in simultaneous merge operations. Figure 6 shows an example of such complex merges. The top row shows the evolution of the root label pointers and the bottom row shows the pending merge operations in black and the finished merge operations in grey. Algorithms presented in this section and the followings are written for a cardinality of 8 (for the sake of clarity) but can easily be extended to 4 or 16 elements. Unmasked equalities and inequalities between vectors are tested using the intrinsics cmpeq_epi32_mask and cmpneq_epi32_mask, but are written as mathematical comparison to be more readable. Masked comparisons are expressed using their corresponding intrinsic.
Algorithm 6 uses gather loads to find the roots of all labels in ì e. The loop must run until all roots have been found, so the number of iterations is equal to the maximum distance between involved labels and their roots. 
Algorithm 7 is inspired by the Playne-Equivalence reduce function designed for parallel CCL on GPU [16] . The main difference with a GPU algorithm is that we have to consider the parallelism within an SIMD vector instead of the parallelism between GPU threads in memory. To solve the concurrency issue the same way the GPU does, we introduce the VecScatterMin function that emulates the behavior of the CUDA atomicMin function. This function takes two vectors ì idx and ì val and tries to perform the operation:
. It then returns the old value of T [ ì idx] if the operation succeeded or the current value if another vector element has written it first. Using this function, only one value is written to a memory address at a time, but it is always the minimum value of the concurrent store operations. This allows the VecUnion operation to retry until all involved equivalence trees have been merged. Because of the pixel topology, there can be at most card/2 simultaneous equivalence tree merges. In practice, the merge vectors are very sparse, allowing us to reduce the numbers of operations needed by compressing the vectors. The VecScatterMin function is described in Algorithm 8. 
SIMD Rosenfeld pixel algorithm
Like its scalar counterpart, the SIMD Rosenfeld pixel algorithm (v1) is a two pass direct CCL algorithm. In order to simplify the algorithm as well as improving the memory footprint and the performance, we embed the equivalence table into the image. The label creation can now easily be done in parallel as the new label is equal to the linear address of the pixel plus 1 to differentiate the background: i × w + j + 1, where (i, j) are the pixel coordinates and w the width of the image. This bijection also allows for a faster relabeling as it can be done during the transitive closure step. 
Algorithm 9 describes the processing of a pixel vector during the first pass. The pixels are processed in a sequential natural reading order. Neighbor vectors ì a, ì b, ì c, ì d and the current pixel ì x can be obtained by doing aligned loads or by register rotation. Border and corners cases can be handled by setting the out of image pixel vectors to zero. We start by constructing unaligned vectors ì ab and ì bc from ì a, ì b, ì c by doing some element shifting (lines 2 and 3). We also compute the bitmask m corresponding to ì
x: a bit is set to 1 for a foreground pixel and to 0 for a background pixel (line 4). The next step is to initialize the labels in ì
x as shown in figure 8 . Each pixel either points to itself or to its left neighbour (lines 6 to 9). We can now compute ì dx from ì d and ì x and propagate the neighbor labels into ì
x (lines 10 and 11). The vec_maskz_min + n function can be implemented using the property of unsigned integers overflow (−1 = MAX_UINT) and the maskz_min_epu32 intrinsics. Finally, ì x can be stored to memory (line 13) and we can call the VecUnion function described in subsection 4.1 (lines 15 to 18). As previously said, only the stairs and concavity patterns can lead to an union operation. The other configurations are handled with the label propagation step. In our implementation, the equivalence table pointer is moved by 1 pixel to account for the +1 in the labels and simplify memory accesses.
The second pass is the simultaneous transitive closure and relabeling. In this pass we don't need the neighbor information making the loading pattern straightforward. For each vector of pixel ì e, we find the corresponding equivalence tree root and write it back to memory. Processing the pixels in the same order as in first pass allows the algorithm to capitalize on previous iterations to find the roots faster. We can compute the true number of label n, using the popcnt_u32 (population count in a 32-bit integer) instruction and some vector comparisons. We use the property that by definition a root label points to itself. Algorithm 10 describes this process. Figure 7 represents the execution of the SIMD Rosenfeld pixel algorithm on a 12 × 4 image and SIMD register of size of 4. The outlined area shows the steps of algorithm 9. The VecUnion operation is not detailed here but the pattern is similar to the one in figure 6 . The modifications in the image from the scan (9 -> 4, 18 ->4) in figure 7 are due to the equivalence table being embed in the image. 
SIMD Rosenfeld sub-segment algorithm
The SIMD rosenfeld pixel algorithm works well for low and medium image densities but for high image densities the performance collapses due to the pixel-recursive labels leading to more iterations in the VecFind while loop. To address this issue at the cost of a few more operations, we introduce the SIMD Rosenfeld sub-segment algorithm (v2). The only difference with the SIMD Rosenfeld pixel x , the current vector of pixels in (i, j), w , the width of the image 1 // Shuffles :
x ì 0 // mask 5 // x labels initialization and min labels propagation : 
algorithm lies in the way we produce new labels (Fig. 8) .
We define a segment as a sequence of same value pixels. A subsegment is a segment bounded by the size of a vector. We use the conflict detection instructions from AVX512CD to compute a conflict-free subset ( ì c f ss) which allows us, with the count leading zero instruction (lzcnt_epi32) and some bit manipulation, to retrieve the index of the first element of a sub-segment. By making the pixel point directly to the sub-segment start, we can reduce the number of VecFind iterations to only 1 jump per sub-segment. Algorithm 11 describes theses changes and figure 9 shows the key states of the sub-segment start computation code. Lines 14 to 16 of algorithm 11 are optional for the correctness of the algorithm but improve the performance. x , the current vector of pixels in (i, j), w , the width of the image 1 // Shuffles :
x ì 0 5 // x labels initialization and min labels propagation : We use OpenMP for the parallel implementation and make the assumption that the memory model is NUMA with shared memory between processors. SIMD algorithms have an increased pressure on memory bandwidth which tends to reduce the multicoreparallelism efficiency if the application is not compute bound.
Parallel SIMD algorithms
The approach we follow in our parallel implementations of the SIMD Rosenfeld pixel and sub-segment algorithms is based on a divide-and-conquer method described in [4] . The image is split into p sub-images, with p the number of cores. This parallel algorithm minimizes the number of merges required by taking a pyramidal approach but diminishes the number of active cores at a given time. It needs log 2 (p) steps to complete the merge. Each step is fully parallel and does not require atomic instructions to update the equivalence table as this scheme merges borders of disjoint sets (Fig. 10) .
First, each processor core takes a sub-image horizontal strip and applies the first pass of either algorithm. Except for the modified loop indexes, there is no difference between the sequential and parallel code. Then, the next step consists in applying a pyramidal merge. As we divide the image by the number of cores available p, the total number of merges required is equal to p −1. For each merge step, we divide the number of active cores doing a two way merge until we have only one core left. On the border line between two subimages, we apply a merging pass on each column. For each SIMD vector of pixels, we have at most two calls to VecUnion. Algorithm 12 describes the body of the border merge loop. Finally, we apply the second pass which does not require any code modification compared to the sequential version.
Algorithm 12: SIMD Border merging
Input: T , the image / equivalence table, ì a, ì b, ì c, ì d , four vector of neighbor labels, ì
x , the current vector of pixels in (i, j), w , the width of the image 1 // Shuffles :
x ì 0 6 // Equivalence trees simultaneous merges :
BENCHMARK 5.1 Benchmark protocol
In our benchmark protocol, we aim to study the behavior of CCL algorithms in the widest range of configurations available, and to see how they compare to each other. For reproducible results, MT19937 [14] was used to generate images of varying density (d ∈ [0% − 100%]) and granularity (д ∈ {1 − 4}). The image density is the ratio of black and white pixels in the image: a black image (with zero pixel set) has a density of 0% and a white image (with all pixel set) has a density of 100%. An image of granularity д is composed of д ×д macro-pixels set to 1 or 0. Granularity increases the local homogeneity of an image. Natural and structured images lead to processing times roughly equal to those of random images with д = 4, while д = 1 provides difficult and unstructured configurations to stress the algorithms and especially those manipulating segments or run-length's. This benchmark methodology for CCL algorithms was previously used in [12] , [3] , [4] , [8] .
We tested the performance of the available SIMD vector length (128, 256, 512) in single and multicore on a dual socket Intel Xeon(R) Gold 6126 running at 2.6 Ghz (turbo-boost off). Before any testing, it was unclear whether the new 512 vector length would have a positive impact on the performance due to the additional stress on the memory bandwidth and the frequency throttle. This is especially true while fully exploiting SIMD and multicore due to bandwidth saturation. Our results are summarized in table 1, 2 and figures 11, 12 and are discussed in the following section. The performance of these new algorithms is compared to the classic pixel-based algorithms with DT (Rosenfeld and Rosenfeld+DT) and to the fastest run-length based segment labeling algorithm (LSL ST D and LSL RLE ) [4] . Figure 11 shows the execution time (in cycles per pixels) on images of varying densities. We observe that for both versions, there is a bump around 45% which corresponds to the percolation threshold in 8-connectivity. We can also see that the main difference in performance between the pixel and sub-segment algorithms happens for higher densities. The pixel version being slower because of the recursive pixel labels leading to a longer while loop in VecFind (as seen in Sec. 4.3). This performance difference grows with the number of cores due to the added cost of find operations in the border merging step. faster than the LSL versions. In the multi-threaded case, this ratio grows to ×3. Compared to LSL, the SIMD versions are faster only for g=1, which is the worst case for full-segment labeling like LSL (the strategy to save memory accesses, especially with the RLE version becomes more profitable than in the mono-threaded case where the pressure on the memory is lower). SIMD Scalability (128 / 256 / 512) In the single-threaded case, doubling the SIMD size provides a speedup around 1.5. In multithreaded case, this ratio still exists but only for 128 / 256 registers. For 256 / 512 registers, the ratio drops to 1.2, due to bandwidth saturation.
Results analysis
Thread scalability: Depending on the SIMD size, the scalability of the algorithms varies in the interval ×10−×13 for 512-bit registers and up to ×15 − ×18 for 128-bit registers. That is an efficiency between 40% (for 512-bit registers) up to 75% (for 128-bit registers). Considering all the instructions for the control-flow and the label propagation within a register, for such an irregular algorithm, we consider the results acceptable.
Performance with image size: The best performance is achieved when the image fits in the cache for all algorithms. (see Tab. 3). For a granularity equal to 4, LSL RLE is still the fastest algorithm. On lower granularities, the new algorithms outperform Rosenfeld+DT by a factor ×2.3 up to ×2.8 for g=4, and from ×2.6 up to ×3.4 for g=1. We also observe that, for larger images, the pixel-based algorithm SIMD v1 becomes faster than the sub-segment-based algorithm SIMD v2. These two new algorithms are specially well-suited to very complex / un-structured / quasi-random images. 
CONCLUSION
This paper presented two new CCL direct algorithms for SIMD multi-core architectures. The work is motivated by a large number of applications in computer vision and could also be of interest for the graph community with a new Union-Find SIMD algorithm. The use of SIMD instructions in CCL and more generally in computer vision is scarce due to the complexity of the code vectorization process. We have developed a performance efficient and portable SIMD algorithms for the Rosenfeld algorithm that outperform scalar pixel-based algorithms by a factor of ×2.3 up to ×3.4.
We found out that the new optimized SIMD algorithms are performing better than the State-of-the-Art algorithms on small images of fine granularity and on single core CPU. While it does not scale as well as run-length based algorithms, a lot of applications which do not require big images but instead search for high throughput could benefit from it. In future work, we plan to test our algorithms on new ARM/Fujisu SVE architectures.
