To confirm an individual's identity accurately and reliably iris recognition systems analyse the texture that is visible in the iris of the eye. The rich random pattern of the iris constitutes a powerful biometric characteristic suitable for biometric identification in large-scale deployments. Identification attempts or deduplication checks require an exhaustive one-to-many comparison. Hence, for large-scale biometric databases with millions of enrollees the time required for a biometric identification is expected to significantly increase.
Abstract
To confirm an individual's identity accurately and reliably iris recognition systems analyse the texture that is visible in the iris of the eye. The rich random pattern of the iris constitutes a powerful biometric characteristic suitable for biometric identification in large-scale deployments. Identification attempts or deduplication checks require an exhaustive one-to-many comparison. Hence, for large-scale biometric databases with millions of enrollees the time required for a biometric identification is expected to significantly increase. In this work we analyse techniques to accelerate Hamming distance-based comparisons of binary biometric reference data, i.e. iris-codes, in large-scale iris recognition systems, which preserve the biometric performance. Focus is put on software-based optimizations, an efficient twostep iris-code alignment process referred to as TripleA, and a combination thereof. Benchmarking the throughput and identifying potential bottlenecks of a portable commodity hardware-based iris recognition system, is of particular interest. Based on conducted experiments we point out practical boundaries of large-scale comparisons in CPU-based iris recognition systems, bridging the gap between the fields of iris recognition and software design.
Introduction
The rich random structure of the iris, and hence its resistance to false matches, constitutes one of the most powerful biometric characteristics [1] . Following Daugman's approach [1] , which represents the core of most public operational deployments, four processing components form an iris recognition system: (1) acquisition, where most current deployments require subjects to fully cooperate with the system in order to capture images of sufficient quality; (2) pre-processing, which includes the detection of the pupil and the outer iris boundary. Subsequently, the iris (approximated in the form of a ring) is normalized to a rectangular texture. To complete the preprocessing, parts of the iris texture which are occluded by eye-lids, eye-lashes or reflections are detected and stored in an according noise-mask; (3) feature extraction, in which an iris-code is generated by convolving local regions of the pre-processed iris texture with filters and encoding responses into bits. This binary data representation enables compact storage and rapid (4) comparison, which is based on the estimation of Hamming distance (HD) scores between pairs of iris-codes and corresponding masks. In the comparison stage circular bit shifts are applied to iris-codes and HD scores are estimated at different shifting positions, i.e. relative tilt angles. The minimal obtained HD, which corresponds to an optimal alignment, represents the final score. It is important to note, that the number of shifting positions employed to determine an appropriate alignment between pairs of iris-codes may vary depending on the application scenario. Some public deployments of iris recognition go as far as = 21 shifting positions when handheld cameras are used for which it is more difficult to ensure an upright capture orientation [2] . Hence, score distributions are skewed towards lower HD scores, which (for a given threshold) increases the probability of a false match by the factor [2] .
Nowadays iris recognition technologies are already deployed in numerous nation-wide projects. Simplicity in design and development as well as the usage of commodity hardware are driving factors behind the deployment of large-scale biometric systems, e.g. the Indian Aadhaar project [3] in which thousands of CPU cores are processing millions of transactions on a daily basis. In such systems identification attempts or de-duplication checks might represent a bottleneck, since these require an exhaustive 1 : comparison where represents the number of subjects registered with the system. In particular, comparison time represents a crucial factor, which dominates the overall computational workload in any large-scale biometric identification system, especially if large values of are unavoidable.
Contribution of Work
In this work focus is put on an iris recognition system, which performs a CPU-based exhaustive search for each authentication attempt. The presented study represents a more common scenario, in contrast to proposed studies, which analyse hardware-specific acceleration of iris recognition systems. Our analyses include a comparative study of the most efficient ways to count disagreeing bits between iris-codes. The potential of manual loop-unrolling as well as different extensions to the x86 instruction set architecture for microprocessors are analysed. In addition, multithreading techniques and statistical optimization of microoperations are considered. Furthermore, we estimate the inter-relation between throughput and rotation compensation provided by an iris recognition system. In order to further accelerate a single pair-wise comparison of iris-codes, we build upon the work of [4] , where we proposed a novel technique for comparing pairs of iris-codes, which we refer to as Accelerated Accuracy-preserving Alignment -TripleA. This method focuses on the alignment process, in which an adjustable two-step search-procedure is employed in order to efficiently determine alignments between iris-codes. Within this procedure only a fraction of shifting positions has to be considered during a single pair-wise comparison, while covering the same range of possible tilt angles. In this work, we enhance the TripleA scheme by applying it to an optimized CPU-based iris recognition scheme. We show that, the TripleA method can be seamlessly integrated, such that the resulting system takes full advantage of TripleA on top of software-based optimisations. In summary, this work provides a detailed guidance of how to substantially accelerate large-scale iris biometric systems on commodity hardware in an accuracy-preserving manner, by combining software-based optimizations with a technique for efficient iris-code alignment. Moreover, summarized key observations might as well provide explanations for anomalies reported in existing studies.
Organisation of Article
This article is organized as follows: related works are discussed in Sect. 2. In Sect. 3 the employed iris recognition system is summarized. A detailed analysis of softwarebased acceleration techniques is given in Sect. 4 and the TripleA method is described in Sect. 5. Experimental results are presented in Sect. 6. Finally, conclusions are drawn in Sect. 7.
Related Work
To circumvent the bottleneck of an exhaustive 1 :
comparison, different concepts have been proposed in order to reduce the workload in an iris biometric (identification) system. We might differentiate between four key concepts: (1) coarse classification or "binning", (2) a serial combination of a computationally efficient and a conventional system, (3) indexing schemes, and (4) hardware-based acceleration.
By binning an iris biometric database into several classes, the workload can be divided by the number of classes, given that irises of registered subjects are equally distributed among them. Natural features to be utilized include eye position (left or right) [5] or eye colour [6, 7] . Recent advances in the field of soft biometrics suggest further possible classification based on gender [8] , age groups [9] , or ethnicity [10, 11] (for further details on soft biometrics the reader is referred to [12] ). Instead of creating tangible, humanunderstandable classes, it is also possible to rely on distinct iris texture features [13] [14] [15] . Binning is equivalent to the combination of biometric systems. Hence, classification errors might significantly increase the false non-match rate (FNMR) of the overall system. Moreover, the potential benefit of binning is limited by the number of bins which determines the factor by which the database size can be reduced.
Within serial combinations computationally efficient biometric systems are used to extract a short-list, i.e. small fraction, of most likely candidates. This procedure might be referred to as pre-screening. While generic iris recognition systems already provide a rapid comparison, more efficient biometric comparators can be obtained by employing compressed versions of original iris-codes during prescreening [16, 17] . Further, a rotation-invariant iris recognition scheme can be applied in the pre-screening step [18] . Similar to binning approaches, a serial combination of a computationally efficient and an accurate (but more complex) scheme might increase the FNMR of the overall system. However, a serial combination enables a more accurate operation of the resulting trade-off between computational effort and accuracy by choosing an adequate size for the short-list.
Indexing schemes aim at constructing hierarchical search structures for iris biometric data, which tolerate a certain amount of biometric variance. Such schemes substantially reduce the overall workload of a biometric identification, e.g. log in case of a binary search tree. Such search structures might be designed for iris-codes [19, 20] as well as iris images [21] [22] [23] . While the majority of works report hit/ penetration rates on distinct datasets, required computational efforts are frequently omitted. The application of complex search structures on rather small datasets may as well cloud the picture about actual gains in terms of speed and leaves the scalability of some approaches questionable.
Adapting comparison procedures to adequate hardware, e.g. multiple cores within a CPU, allows for parallelization [24] . By simultaneously executing a number of threads the workload can be significantly reduced since a 1 : comparison can be performed in parallel on various subsets of equal size. Also the estimation of HD scores at various shifting positions during alignment can be parallelized. Moreover, iris-code comparisons can be efficiently performed on the GPU using GPGPU or CUDA [25] , FPGA [24, 26] , or other specialized hardware like CELL processors [27] . Apart from hardware-based acceleration, most of presented schemes either fail to provide a significant acceleration or they suffer from a significant decrease in recognition accuracy. Hence, existing approaches often obtain a tradeoff between biometric performance (recognition accuracy) and speed-up, compared to a traditional iris recognition system. In practice most concepts do not allow for a seamless integration into a conventional identification system. The majority of hardware-specific acceleration techniques of iris recognition systems is custom-built, which makes it difficult to derive generally applicable methodologies or concepts. Moreover, anomalies in runtime tests are frequently left uncommented.
Iris Recognition System
The following subsections summarize the key components of the employed iris recognition systems.
Preprocessing and Feature Extraction
In the employed iris recognition system, which builds upon common processing components, the iris of a given sample image is detected and transformed to a rectangular texture of 512×64 pixels applying a contrast-adjusted Hough transform. The enhanced texture is obtained by applying contrast limited adaptive histogram equalization (CLAHE). In the feature extraction stage the enhanced texture is divided into stripes resulting in 10 one-dimensional signals, each one averaged from the pixels of 5 adjacent rows (the upper 512×50 rows are analysed). The first feature extraction method follows the Daugman-like 1D-LogGabor fea-ture extraction algorithm of Masek [28] (LG) and the second follows the algorithm proposed by Ma et al. [29] (QSW) based on a quadratic spline wavelet transform. Both feature extraction techniques generate an iris-code IC, which consists of of =512×10=5, 120 bits. Fig. 1 illustrates the described processing chain for a sample iris image. Custom implementations of employed segmentation and feature extractors are freely available in the University of Salzburg Iris Toolkit (USIT) [30] . For further details on the employed feature extraction algorithms the reader is referred to [31] . Note that a compression of iris-codes, e.g. to 2,048 bits as suggested in [1] , might cause a decrease in biometric performance [16] , especially in challenging unconstrained scenarios.
Iris-Code Comparison
In the comparison stage circular bit shifts are applied to iriscodes and HD scores are estimated at different shifting positions, i.e. relative tilt angles. In the used scheme a 1-bit shift equals 0.7 ∘ of rotation. Let ( , ) denote an iriscode shifted by bits. Assuming that blocks of bits are processed at a time, the final comparison score between a query and a reference iris-code, IC Q and IC R , and their corresponding noise masks, M Q and M R , is estimated as,
Since iris-codes can be shifted prior to comparison and only a single division is required, the workload for calculating scores between iris-codes is dominated by the following three (per-block) processing steps: 1. XOR: the exclusive or (⊕) detects disagreeing bits between two -bit blocks, resulting in bit block of same size where 1s indicate differing bits. 2. POPCNT: the population count (‖⋅‖), or Hamming weight, counts the number of 1s in the vector extracted in the first step, i.e. the amount of detected differences. 3. ADD: the amount of disagreeing bits is added up (∑) for all -bit blocks. Of these processing steps, POPCNT represents the most complex one and most of presented software-based optimisations will focus on speeding up its calculation (see Sect. 4). Nevertheless, the other two steps are also analysed where appropriate.
Software-based Optimizations
From a practical point of view, we identified seven settings as most relevant, S-1 to S-7, which are described in the following subsections.
Look-up Tables, Intrinsics and LoopUnrolling
Look-up Assembler POPCNT (S-3): instead of high level intrinsics the POPCNT command is directly invoked via inline assembler code in a C++ function.
Manual loop-unrolling (S-4): even though loop-unrolling is activated for the compiler, this experiment measures the impact on the overall duration regarding the (manually adjusted) number of bit blocks processed per loop iteration.
SSE2 and AVX (S-5):
we also consider calculating XOR for 128-bit blocks with the Streaming SIMD Extensions 2 (SSE2) instruction PXOR, the Advanced Vector Extensions (AVX) 256-bit equivalents VXORPD, the AVX2 256-bit version VPXORPD and measure the impact of addition trees using the AVX2 8-bit and 16-bit vectoring commands VPADDB and VPADDW. The latter operations can add 32 8-bit packed integers and 16 16-bit packed integers with one operation, respectively.
Multithreading and Statistical MicroOps Optimisation
Multithreading (S-6): iris-code comparisons are split upon multiple threads. Like in the previous settings, S-2 to S-5, POPCNT and ADD operations are performed alternatingly (PAPA). First a given query iris-code is compared to all preshifted versions of stored reference iris-codes. Hence, no shifting operations have to be performed at the time of comparison, while storage requirement, which is usually not a crucial factor, increases. In an alternative implementation the query iris-code is shifted prior to comparison against all stored non-shifted reference iris-codes. Both settings, which are referred to as PAPA and PAPA , describe the same transposed algorithm and result in the same amount of bit comparisons.
Statistical micro-ops optimisation (S-7)
: static data dependency, latency and throughput analysis are utilized to minimise latencies of micro-operations. The resulting strategies, which are referred to as PPAA and PPAA , perform all POPCNT operations first and add up all intermediate results afterwards. 
Accelerated Accuracy-preserving Alignment
The following subsections present an analysis of HD scores estimated from genuine iris-code comparisons across various shifting positions, which motivates the adjustable twostep search-procedure, referred to as TripleA [4] .
Iris-Code Analysis
For both feature extractors Fig. 2 shows the HD scores across different shifting positions for three genuine comparisons of iris-codes. It can be seen that, for each feature extraction algorithm the HD scores of the three genuine comparisons seem almost identical. Within a certain range HD scores constantly decrease towards the minimum (best) score. This range is enclosed by local maxima resulting in HD scores significantly beyond 0.5. For the sample HD scores in Fig. 2 these local maxima can be detected at shifting positions of ±8 bits for LG and ±6 bits for QSW. A detailed analysis of this phenomenon is provided in [4] . Intuitively, the distance between the shifting position resulting in a minimum HD score and those of surrounding local HD score maxima might be approximated by the average length of 1-bit and 0-bit sequences , as ± bit shifts are expected to cause the most drastic misalignment. The sequence of HD scores between genuine iris-codes across various shifting positions might be interpreted as an oscillation which decreases its amplitude with the distance to the minimum score. For such a signal it can be empirically verified that distances between consecutive vertices are virtually the same for a constant value of even in case of large standard deviations.
TripleA
The TripleA approach comprises the following two key steps: (1) estimation of near-optimal alignment and (2) estimation of subset-minimum. An example of the approach is illustrated in Fig. 3 .
Pre-processing, feature extraction
Step 1:
Step 2: In the first step the range of = 2 + 1 shifting positions [− ; ] is divided into 2 ⌈ / ⌉ intervals, where denotes the employed step-size. Then HD scores are estimated at interval boundaries, i.e. for a subset of 2 ⌈ / ⌉ + 1 shifting positions. In other words, the sequence of scores, interpreted as signal, is sampled every bits. For a genuine comparison a sampling with at most the average length of 1-bit and 0-bit sequences, < , is expected to detect a minimum score which represents a near-optimal alignment. We consider an alignment as near-optimal if the corresponding shifting position is close enough to the optimal alignment revealing a HD score, which is significantly smaller compared to remaining sampling positions. For the sample comparisons of Fig. 2 near-optimal alignments would be found in the range of approximately ±2 bit shifts.
After detecting a near-optimal alignment at shifting position the interval [ − + 1; + − 1] is considered for the second step. Note that the scores for positions ± have already been estimated in the first step. Based on a linear search the second step detects a minimum HD score for a subset of 2( − 1) shifting positions. That is, the number of shifting positions to be considered is reduced to = 2 ⌈ / ⌉ + 1 + 2( − 1). To further accelerate the TripleA alignment procedure it is suggested to process only half of the subset detected in the first step during the second step. This bisected interval is defined by and minimum of surrounding HD scores at ± . Hence, the number of shifting positions is further reduced to = 2 ⌈ / ⌉ + . In the example of Fig. 3 the interval [ − + 1, − 1] would be chosen for the linear search of the second step, since the HD score at shifting position − is smaller than that at + . This derivation is referred to as TripleA-Single-Sided. In Fig. 4 the number of To obtain a maximum speed-up has to be minimized, such that = √ 2 / √ 2 and = √ 2 represent the theoretical optimal step-size in terms of speed-up for TripleA and TripleA-SS, respectively. In [4] we showed that, can be dynamically estimated from a single reference iris-code during enrolment, however, this dynamic estimation was not found to yield any significant gains in terms of performance are obtained. Hence, we restrict to applying static values of for each comparison performed by the system. In this case can be averaged from a training set of extracted iris-codes.
Experiments
The following subsections describe the experimental setup and summarize results obtained by the presented approaches.
Experimental Setup and Methodology
Experimental evaluations are carried out on the CASIAv4-Interval iris database [32] . The database consists of =2, 639 good-quality 320×280 pixel NIR iris images of 249 subjects. We consider two types of experiments, where in both experiments an iris-code is compared against shifted versions of another one:
Experiment 1 (E-1): the maximum number of ( −1)/2 = 3, 480, 841 iris-code cross-comparisons is performed. Based on obtained scores we identify an adequate trade-off between biometric performance and provided rotation compensation. Subsequently, diverse settings with the aim of accelerating these iris-code cross-comparisons are compared and the best setting is identified. For time measurements we execute a total number of 40 iterations and the obtained median time elapsed is reported. The considered number of iterations minimizes the influence of outliers with respect to time measurements, which assures significance of relative improvements or degradations in comparison speed. This experiment might reflect a de-duplication check on an iris-code database with registered subjects.
Experiment 2 (E-2): the dataset is partitioned into a reference set of 2,500 iris-codes and a query set of 139 iris-codes. To simulate identification attempts on a large-scale database the reference set is extended to a large-scale dataset by replicating the subset 20,000 times, resulting in a set of =2, 500×20, 000=50, 000, 000 iris-codes. Note that the obtained set is used for runtime experiments only. For the best setting of E-1, in terms of throughput, all 139 identification attempts (1: ) are performed and the obtained median time elapsed is reported for various degrees of rotation compensation. Subsequently, the TripleA method is applied with different parameter configurations on top of the best setting of E-1 in order to obtain further speed-ups.
The main difference between these experiments is that, while in E-1, the de-duplication experiment, a total number of query iris-codes are successively compared against the database, in E-2, the identification experiment, a single query iris-code is compared against a huge database.
Biometric performance is estimated in terms of FNMR at a target false match rate (FMR) and equal error rate (EER) obtained from E-1. The test system for measuring the duration of E-1 and E-2 with different settings uses an x86_64 Linux operating system with kernel version 4.4 and GCC 5.3.0 as C++ compiler. While other CPU-types, e.g. ARM-based, have been analysed with respect to the required operations [33] , focusing on large-scale biometric systems x86_64 hardware is considered as most relevant. The utilised CPU is an Intel Core i7-6700 with sufficient DDR4-SDRAM 2133.
In order to identify an appropriate degree of rotation compensation in E-1, we first calculate EERs and FNMRs at a FMR of 0.01%, denoted as FNMR 0.01 , considering ± shifting positions during alignment. The progress in terms of EER and FNMR 0.01 with respect to rotation compensation is shown in Table 1 . As can be seen, the majority of misalignments is compensated by ±8 bit shifts (∼6 ∘ ) while biometric performance converges at approximately ±16 bit shifts (∼11 ∘ ). Focusing on recognition accuracy versus required bit-shifting we choose = ±16, resulting in 2 +1 = 33 shifting positions, is considered as reasonable trade-off for the used iris recognition systems resulting in an EER of 0.80% and a FNMR 0.01 of 1.75% for LG and an EER of 0.74% and a FNMR 0.01 of 1.06% for QSW. Without any optimisation the 32-bit population count implementation in S-2, using intrinsics to invoke the SSE4 POPCNT instruction provides a tenfold speed-up compared to S-1. The 64-bit version can double the data processing per instruction and is therefore even faster. It is not twice as fast as the 32-bit implementation due to overhead of the bigger 64-bit address handling for data access and pointer dereferencing. Based on this observation subsequent settings process blocks of = 64 bits.
Software-based Optimizations
The inline assembler of S-3 also provides a clear speedup over high level POPCNT intrinsic calls used in S-2.
Focusing on S-4, Fig. 5(a) shows that the preferred number of -bit blocks processed per loop iteration is 8. We identify two reasons to justify this behaviour: on the one hand 8 64-bit blocks fit very well in the general purpose registers of the x86_64 processor and no memory access is needed for the XOR, POPCNT, ADD operation, see Fig. 6(b) lines 20-36 ; on the other hand 8 × 64 bit are exactly 64 byte which is the same size as one CPU cache line. Since a cache line copied from memory is exactly 64 byte it is preferable to process the complete cache line resulting in a favourable cache hit/miss ratio. We therefore recommend the processing of data in 64 byte blocks and storing it as a continuous array for an optimal exploitation of the CPU caches. Hence, in settings S-6 and S-7 a total number of 8 64-bit blocks are processed per loop iteration.
Settings S-5a, S-5b and S-5c make use of SSE2, AVX and AVX2 instructions to process bigger data chunks with the XOR operation. However, no significant speed-up over the common x86 64-bit XOR instruction is obtained. The reason for this is very straightforward, since SSE works on specific registers, the so called 128-bit XMM registers and AVX on the 256-bit YMM registers. Data has to be loaded to and retrieved from these registers before it can be used with SSE/AVX instructions. In contrast, the SSE4 POPCNT command operates on 64-bit general purpose registers of a CPU. Therefore, a transfer between these registers is necessary where the overhead for these transfers is higher than a straightforward processing by the common XOR command which operates on the same registers as the POPCNT instruction. SSE and AVX are optimised for algorithms which do a lot of operations on a comparably low amount data. Calculating a great amount of iris-code comparisons, which requires only very few operations on extreme amounts of data, is no such problem. Settings S-5d and S-5e, which implement the AVX2 vector addition, are slower for the same reasons. Note that the SSSE3 implementation tested in S-5e is considered the fasted POPCNT implementation by experts in the field [34] . In contrast, we observe that the hardware POPCNT instruction used in S-2 to S-4, is clearly superior to the SSSE3 implementation. Still, for older CPUs where no POPCNT instruction is available, this could still be of interest since it is faster than an 8-bit look-up table.
The common idea to compare a freshly extracted query iris-code to a large pre-shifted database of reference iriscodes is represented in S-6a. As shown in Fig. 5 (b) for 1 to 3 threads this setting behaves as expected, but starting from 4 threads the runtime stagnates at roughly 4 seconds, i.e. dividing the workload in more threads provides no further speed-up. As one iris-code consists of 512×20 bits (1280 byte), we have 3,480,841 comparisons and for each comparison a new iris-code has to be loaded from memory, resulting in roughly 137 GB of data transferred from memory to the CPU. Our experiment computer uses DDR4-2133 RAM with a speed of 17.0 GB/s per channel according to specification [35] . We are using a common dual channel setup and, hence have a maximum RAM bandwidth of 34 GB/s. Hence, transferring 137 GB from memory to CPU takes at least 4 seconds. In this setup the execution speed of the implemented algorithm is interfered by the relatively slow RAM to CPU interface. The RAM as bottleneck is a common problem for highly multithreaded tasks performing a few : "r" (buf [2] ), "r" (buf[3]),
34
: "r" (buf [4] ), "r" (buf [5] ),
35
: "r" (buf [6] ), "r" (buf [7] )
(a) C++ / Inline ASM operations on a big amount of data [36] . The bottleneck gets enhanced by the fact that this biometric scenario floods the CPU caches with all new data and is practically not using them at all, resulting in a very poor cache hit/miss ratio. In S-6b shifted versions of the given query iris-code are computed and compared to non-shifted reference iris-codes of the database. From a computational perspective, this setup seems less intuitive because the shifted versions have to be computed before the actual comparison can start, but the iris-codes can stay in the CPU caches across all comparisons and only one 1,280 byte block has to be loaded for each comparison, resulting in much less actual memory access since the CPU caches have a high hit count for the shifted iris-codes [37] . Therefore, S-6b scales much better with multiple threads as highlighted in Fig. 5(b) . Hence, the subsequent setting will be based on this strategy. Moreover, in S-6b we observe the effect that 5 threads are actually slower than 4 threads. The used Intel Core i7-6700 processor has 4 physical cores of which each can process 2 threads at once due to hyper threading [38] . As depicted in Fig. 7 , in case 5 threads are used 2 threads have to share the L1 and L2 cache on one core. Therefore, the iris-code prefetching, see Fig. 6 (b) lines 1-2, is not as effective as if one thread uses the complete cache. This effect occurs since both threads are working on completely independent parts of the iris-code database. Due to this aspect 8 threads are only negligibly faster than 4 threads. Setting S-7 implements the results obtained by the Intel Architecture Code Analyzer [39] which suggests the PPAA strategy instead of the PAPA strategy of previous settings Step (see Sect. 4), as shown in Fig. 6 . As can be seen in Table 2 and Fig. 5(b) , this results in minor speed-up which would be more significant for larger databases. That is, optimising the order of the instruction sequence for the used microarchitecture by static code analyser can still improve the overall performance even in case modern CPUs support out of order execution, which should (in theory) do this automatically.
The presented results are obtained using a Linux operating system. It is important to note that identical performance rates are achieved on other types of operating systems (OSs), since basic memory operations, in particular cache management, is independent of the used OS.
Accelerated Accuracy-preserving Alignment
For different configurations of TripleA using static step-sizes, Table 4 : Overview of time measurements (in seconds) for different settings in experiments E-2 performing an identification with = 50, 000, 000 at 33 shifting positions using = 4. drastic decrease in accuracy while providing further speedup as will be shown in the following subsection.
Simulation of Large Scale Identification
For E-2 a large scale identification scenario, the best setting PPAA resulting from E-1 is selected as baseline. Fig. 8 presents the absolute number of iris-code comparisons per second. Again, emphasis should be placed on relative difference in throughput rates of different configurations. Due to the efficient CPU caches the comparisons per second depend on how well the shifted iris-codes fit into the caches and the break even point from 4 to 5 threads, can similarly be observed as in the 1 : identification scenario, due to 2 threads sharing one cache. Therefore, having 8 threads reveals no significant speed-up over 4 threads. Both setups roughly compare 4.6 million iris-codes per second using ±8 bit shifts (≃ 80 million comparisons per second without shifting).
Based on the findings depicted in Table 1 and Table 3 further scenarios in E-2 utilizing TripleA and TripleA-SS are performed with the parameters = 16, = 4 as step-size and PPAA as core HD score comparator. These experiment results are summarized in Table 4 and depicted in Fig. 9 .
From a theoretical standpoint the expected speed-up can be approximated by comparing the number of shifted iris-code comparisons to the baseline algorithm PPAA .
The baseline algorithm has to process all shifting positions, resulting in 33 comparisons. TripleA with the selected parameters does 9 comparisons in Step 1 and in general 6 more in Step 2. In the special case of Step 1 yielding − or as result only 3 comparisons are performed in Step 2. This is considered negligible for an approximation and TripleA is considered performing 15 comparisons per iris-code. The special case of TripleA is the regular case of TripleA-SS since only a single side is considered during Step 2. Therefore, the baseline does 33 comparisons, TripleA 15 comparisons and TripleA-SS 12 comparisons, which results in an approximation of TripleA taking 45% and TripleA-SS only 36% of the time compared to the baseline. These theoretical considerations match the observed results in Table 4 taking measuring tolerance into account. It means in effect TripleA and TripleA-SS scale linearly to the number of comparisons relative to the baseline algorithm PPAA and all further and combinations can be effectively approximated using the results from E-2. Fig. 9 further depicts that TripleA and TripleA-SS yield no further anomalies that were not present in the PPAA baseline algorithm.
Conclusions
In this work we analysed commodity hardware-based iris recognition systems, which perform a CPU-based exhaustive comparison on a large-scale database. We showed that utilising the POPCNT hardware instruction can significantly speed up biometric comparisons based on the Hamming distance. We identified that taking the CPU caches into consideration during the algorithm design is the most efficient way to circumvent potential RAM bottlenecks. Especially when making use of multithreading ignoring these caches will lead to bottlenecks and even make the actual comparison algorithm secondary since the greatest share of time is claimed by the RAM to CPU data transfer and not the actual execution of the algorithm. This observation also impacts the reflection of iris-code comparisons based on GPGPU/CUDA since their speed-up is not only explained due to the high number of cores (hardware shaders), but also the higher memory bandwidth of Video RAM (GDDR) compared to common RAM (DDR). Therefore, GPGPU/CUDA implementations have to deal to a lesser extend with memory bottlenecks. Awareness of cache line sizes on the target system can also greatly improve the data throughput since it maximises cache hits, particular in hotspot loops. Taking into account the aforementioned issues, it is shown that, an optimized conventional CPUbased iris-biometric comparator can achieve a hundredfold speed-up compared to a naïve baseline comparator. As our 1 : results with different shifts sizes show, the number of comparisons alone is no sufficient statement, since the fitting of all shifted iris-code versions into the CPU cache is a high performance factor, independent of the actual algorithm or achieved comparisons per second. Further, our results show that by combining the TripleA algorithm with a fast multithreaded POPCNT implementation response times of large scale biometric systems can be further decreased, achieving a more than two-hundredfold overall speed-up. Finally, it is important to point out that these findings may also be exploited in other software-based acceleration techniques, e.g. [17] .
