Abstract-Optimizing connected component labeling is currently a very active research field. The current most effective algorithms although close in their design are based on different memory/computation trade-offs. This paper presents a review of these algorithms and a detailed benchmark on several Intel and ARM embedded processors that allows to focus on their advantages and drawbacks and to highlight how processor architecture impact them.
INTRODUCTION
Binary Connected Component Labeling (CCL) algorithms deal with graph coloring and transitive closure computation. CCL algorithms play a central part in machine vision, because it is often a mandatory step between low-level image processing (filtering) and high-level image processing (recognition, decision). As such, CCL algorithms have numerous applications and derivate algorithms like: convex hull computation, hysteresis filtering, geodesic reconstruction.
Designing a new algorithm is challenging both from considering the overwhelming literature and the performance of best existing algorithms. Goals might be a faster algorithm on some class of computer architecture or minimizing the number of over-created labels or the smallest theoretical complexity. Yet another issue is to be most predictable. Now, from the current state of the computing technology, reaching decent performances in actuality requires for CCL algorithms to take into account two specificities/capacities of current General Purpose Processors (GPP): the processor pipeline and its cache memories. That amounts to minimize conditional statements (like tests and comparisons) to reduce the number of pipeline stalls and limit random sparse (typically vertical) memory accesses, to lower cache misses.
The embedded processing applications ask continuously to process bigger images in a smaller time and to consume as little energy as possible. That is why we focused on mobile processors in this study.
As it is an intermediate level algorithm, CCL processes the output data coming from low level algorithms (binary segmentation, ...) and provides abstract input data to other intermediate or high level algorithms. Usually, such abstract data also called features are the boundary of bounding rectangle (for target tracking) and the first order statistical moments (surface, centroid, orientation, ...). So, if a standalone CCL algorithm can be considered at first step, the couple "CCL + features computation" is the procedure to be actually evaluated at end.
Our contribution consists of three elements:
• an enhanced benchmark that incorporates random images with different granularities. That can be seen as a bridge between classical random images of density and data base images, • a performance benchmark with all state-of-the-art algorithms on embedded general purpose processors from Intel and ARM, • an analysis from the energy point of view. In the remainder of this paper we shall describe modern algorithms and describe the benchmark's procedure and hardware. We then present the results on Intel's and ARM's architectures and finally provide a comparison from the energy point of view.
I. CONNECTED COMPONENT LABELING ALGORITHMS
Historical algorithms were designed by pioneers like Rosenfeld [14] , Haralick [4] , and Lumia [10] who designed pixelbased algorithms, and Ronse [13] for run-based algorithm. Modern algorithms derive from the historical ones and try improvements by replacing some components by a more efficient one. An extensive bibliography can be found in [5] and [16] .
Except Contour Tracing algorithm [1] that is aesthetic but inefficient, all modern algorithms are two-passes (or less) algorithms, none is a data dependent multi-pass algorithm. They share the same three steps:
• first labeling that assigns a temporary/provisional label to each pixel and builds labels equivalence, • label equivalences solving that is to compute the transitive closure of the graph associated to the label equivalence table, • second labeling to replace temporary label by the final label (usually the smallest one of the component). They differ on three points: the mask topology, the number of tests for a given mask to find out the label, and the equivalence management algorithm.
Using Rosenfeld mask ( fig. 1 ), only two basic patterns trigger label creation ( fig. 2) , whatever is the connectivity (here 8-connectivity). The first one is the stair. It is responsible for the unnecessary provisional label created by pixel-based algorithm like Rosenfeld's one. The second one is the concavity. With the neighborhood CCL locality, it is obvious that the label creation cannot be avoided.
As figure 4 and figure 5 show, the execution time is not directly correlated to the total amount of final labels, but to the number of stairs and concavities that generate equivalence building. So, one way to improve CCL algorithms is to widen the label mask. That leads to block-based algorithms ( fig. 1 ) like HCS 2 [7] and Grana [3] that respectively compute 2 and 4 labels from 6-pixel and 16-pixel neighborhood. One the opposite way, RCM [8] introduces a mask with only 3 neighbors in order to reduce the amount of tests. Grana's mask can detect some concavities and avoid label creation if these concavities are small enough to entirely fit in the mask. But the only way to prevent label creation from stair is to use a run-based algorithm like HCS [6] or Light-Speed Labeling (LSL) [9] that first detect the pixel adjacency in the neighborhood before to assign a label to the run. The LSL uses a tricky line-relative labeling to generate RLC coding to directly find adjacent runs on the previous line whereas HCS has to perform a test on every pixel to decide to continue to propagate a label or to perform an equivalence. The second point to enhance algorithm efficiency is to reduce the number of tests. A decision tree (DT) [16] reduces the average number of neighbor to test to find out the value to assign to the current label based on mask topology. For pixel-based algorithms, it decreases the complexity of the 8-connectivity to the 4-connectivity one. For block-based algorithm, DT is mandatory. Another way to reduce complexity is to perform a path-compression (PC) [2] . It is a step added to the Union-Find algorithm to perform a transitive closure in climbing up to the root of the equivalence. It has been proven that PC make the Union-Find complexity to grow with the inverse of Ackermann function [15] .
Finally, the third point to improve is the equivalence management algorithm. Rosenfeld's algorithm uses Union-Find algorithm and the associated table to store the equivalences. An alternative approach with three tables has been proposed by [5] 
II. BENCHMARKS
We present here the images and processors used for benchmarking. We also provide a qualitative analysis of temporary labels creation.
A. Random images generation and qualitative analysis
Depending on the OS and the compiler the pseudo-random number generator embedded into the libC can change, so providing the seed is not enough if one wants to do reproducible experiments. For that reason, the Mersenne Twister MT19937 [11] has been chosen with seed = 0.
Usually papers evaluate CCL performance first with random images (varying pixel density from 0% to 100%) for hardto-label benchmark and secondly with image data base. But data base can be biased and then may favor some algorithms. In order to analyze algorithms behavior depending on some image properties: size of connected components and size of the smaller element compared to the algorithm neighborhood scale, we decided to extend random images by changing the pixel granularity. Initial random image has a granularity of 1. Then we create g-random images whose block of pixels have a size of g × g, with g ∈ [1 : 16]. The symmetrical shape of these blocks ensure an equitable treatment between the different algorithms. All the random images are 1024 × 1024 (width × height).
The figure 4 provide the temporary labels distribution for granularity g ∈ {1, 2, 4} for pixel-based, run-based and Grana's algorithms (red, magenta, and blue). The number of final labels (green), concavities (cyan), and stairs (orange) is also provided.
First, if we compare run-based and pixel-based label distribution, we can see that run-based curve has always the same Distribution of labels, concavities and stairs versus density for granularity g ∈ {1, 2, 4} behavior (close to the final label curve), contrary to the pixelbased curve. The reason is that the amount of concavities is proportionally constant (from one granularity to another one) to the number of final label. For g ≥ 2, it appears that the amount of stairs becomes bigger than concavities, and thus increases the gap between the number of labels of pixel-based and run-based. That is the reason why run-based algorithms have a better execution time when g is growing: they avoid more and more label creation.
Concerning Grana algorithm, it generates quite the same number of temporary labels for g = 1 than pixel-based ones. For g = 2 it comes closer to run-based algorithms as its wide mask avoids many temporary labels. But for g ≥ 4, its wide mask does not avoid label creation, as 4-pixel wide stair and concavities are beyond the pixel's neighborhood.
B. Image data base
The Standard Image Data-Base (SIDBA) has been used for natural image labeling. Gray-scaled images have been automatically binarized with Otsu algorithm [12] . For both random images and natural ones, we provide the cpp (cycle per pixel) of each algorithm, with features computation. The features extracted for each component are: the bounding box ([x min , x max ]×[y min , y max ]) and the first statistical moments (S, S x , and S y ).
C. General Purpose Processors
In order to evaluate the impact of the architecture on the execution time, we selected two mobile processors from Intel: PenrynM (U9300, 1.2GHz, 10W) , HaswellM (4650U, 1.7GHz, 15W) and two embedded from ARM: CortexA9 (OMAP4460, 1.2GHz, 1.2W) and CortexA15 (Exynos5250, 1.7GHz, 1.7W). HaswellM and CortexA9 were chosen for the curves and tables, however all the SIDBA results were reported in figures 9 and 10. Executable codes were generated with Intel ICC v14.0.1 and gcc-arm 4.6.3. a) Density behavior: Figure 5 shows us that algorithm curves -except HCS 2 -, are symmetrical about their maximum value. The abscissas of the maximum values are contained in the [45%; 55%] area depending on the algorithm. Concavities and stairs ( fig. 4 ), lead to temporary label creation and labels merging, they also increase the probability of having more tests to perform in the decision tree (e.g., stair makes to traverse all the DT graph until the label creation node "+1" - figure. 3) and doing so, increase cpp.
III. RESULTS AND ANALYSIS
One can observe that when the number of stairs and concavities decrease (g comes higher) the density curves tend to flatten. As described in [6] , HCS 2 algorithm make no usage of decision tree and so, it needs to load the neighborhood's labels for each pixel to label. Doing so, it is not able to reduce cpp when density grows above 50%. b) Granularity influence: Table I and figure 5 describe the behavior of algorithms faced to images of different granularities. The main trend is that when g grows cpp drops. First quickly [×0.49; ×0.69] for g ∈ {1, 2}, and then slowly [×0.30; ×0.76] for g ∈ [2:16]. One can notice that LSL RLE is the most accelerated when granularity grows while LSL ST D is the most regular. It comes from their construction as explained in [9] . LSL RLE is inefficient for g = 1 because of its run length encoding kernel. RCM is efficient but only for g = 1, this is due to the smaller number of tests it performs compared to Rosenf eld which is an efficient strategy on unstructured data. (table II and fig. 6 ). This is mostly due to on-the-fly FC, which make the last relabeling scan unnecessary [9] . FC is almost always faster than doing the second labeling, especially for LSL RLE where run length coding speeds up FC. In facts, the addition of FC accelerates the LSL algorithm. LSL ST D is first for g ∈ [1:2] and LSL RLE is first for g ∈]2:16]. For structured data (higher g values), the algorithm ranking becomes (first to last): LSL RLE , LSL ST D , Grana, Suzuki, Rosenf eld, HCS, HCS 2 , and RCM .
FC increases equally the cpp of every other algorithms depending on g (even if number of pixels is constant as density is constant). From 4.3 cpp for g = 16 up to 9.0 cpp for g = 1 for HaswellM (18.8 cpp up to 25.6 cpp for CortexA9). Those variations are explained by the structure of the image (fig. 4) . If granularity is low there are more labels than if granularity is high. So sparser memory accesses will happen, leading to different amounts of cache hits and cache misses. d) Architecture influence -HaswellM/CortexA9: The most noticeable differences between HaswellM and CortexA9 are the increase of cpp and the evolution of the algorithms rank. Table III highlights CortexA9 for each algorithm. The lowest value means that the algorithm is comparatively less slowed on CortexA9 than the others, One can remark that, LSL RLE , HCS, Grana, and LSL ST D make a better use of the CortexA9 than HCS 2 , RCM , Rosenf eld, and Suzuki. There are two explanations: conditional instructions and memory latency who question the trade-offs made by algorithms. RCM does less tests for each pixel, less loads for foreground pixel, but more loads for background ones. These choices might be valuable on HaswellM but are less efficient on CortexA9. Grana and HCS 2 execute more tests than others and less memory accesses due to their block-based construction. The gap between LSL RLE and LSL ST D is bigger on CortexA9 than HaswellM, because LSL RLE performs less memory accesses (only for the start and the end of runs). As HCS is a run-based algorithm it performs less memory access than pixel based. (table IV) with min, average and max values for processing time and cpp, to allow direct comparison with others articles' results. f) Architecture influence -A15/A9: The relative order of algorithm is maintained except for LSL RLE that is less accelerated than the others. This is due to its already optimized memory management that takes less advantages from A15 optimizations. g) Energy consumption: Table V presents I E an energy index that is proportional to the average energy consumption (I E = t × TDP of the whole dual-core processor). As TDP reflects the power consumption of the two cores, I E is higher than the real energy consumption. But as the benchmarked processor have two cores, I E enforces the order relation between processors. On HaswellM, PenrynM, and CortexA9, LSL RLE is the best. On A15 it is LSL ST D . The CortexA15 is, right now, the most energy-efficient architecture. IV. CONCLUSION AND FUTURE WORKS In this paper, we have proposed a new detailed benchmark procedure for connected component labeling with granularity steps that is complemented with the use of a standard database.
The benchmark procedure, confirms that for real applications (that is with features computations) LSL RLE algorithm outperforms all state-of-the-art algorithms, on both Intel and ARM processors. For time-predictability and standard deviation, LSL ST D is the best choice.
Future works will consider parallelization of CCL on GPP.
