Efficient Nearest Neighbor (NN) search in high-dimensional spaces is a foundation of many multimedia retrieval systems. A common approach is to rely on Product Quantization that allows storing large vector databases in memory and also allows efficient distance computations. Yet, implementations of nearest neighbor search with Product Quantization have their performance limited by the many memory accesses they perform [1] . Following this observation, André et al. [2] proposed more efficient implementations of PQ m×4 product quantizers (PQ) leveraging specific SIMD instructions.
INTRODUCTION
The Nearest Neighbor (NN) search problem consists in finding the closest vector x to a query vector y among a database of N d-dimensional vectors. Efficient NN search in high-dimensional spaces is a requirement in many multimedia retrieval applications, such as image similarity search, image classification, or object recognition. These problems typically involve extracting high-dimensional feature vectors, or descriptors, and finding the NN of the extracted descriptors among a database of descriptors. For images, SIFT [25] , GIST [29] or Deep-learning-based [7] descriptors are often used.
Although efficient NN search solutions have been proposed for low-dimensional spaces, exact NN search remains challenging in high-dimensional spaces due to the notorious curse of dimensionality. As a consequence, much research work has been devoted to Approximate Nearest Neighbor (ANN) search. ANN search returns sufficiently close neighbors instead of the exact NN. Product Quantization (PQ) [19] is a widely used [24, 31] ANN search approach. PQ compresses high-dimensional vectors into short codes of a few bytes, enabling in-memory storage of large databases.
Fast answer is a key feature of PQ. It is enabled by Asymmetric Distance Computation (ADC), which efficiently computes distances between uncompressed query vectors and compressed database vectors using in-memory lookup tables. Yet, despite being faster * Fabien André was with Technicolor when he contributed to this work. than regular distance computation, ADC remains bottlenecked by the many memory accesses it performs [1] . To date, much of the research work has been devoted to the development of efficient inverted indexes [5, 30] , which reduce the number of ADCs required to answer NN queries. Recently, there also has been an interest in increasing the performance of the ADC procedure itself with the introduction of PQ Fast Scan [1] , Quick ADC [2] , or Polysemous Codes [11] . Quick ADC leverages SIMD shuffle instruction to avoid memory accesses and implement very fast ADC; yet it is restricted to 4-bit sub-quantizers and has only been evaluated on simple inverted indexes, thus lacking results for more advanced indexes. Polysemous Codes leverage the freedom to choose code indices to encode a binary code that is used to efficiently compute an hamming distance and prune ADC computations; yet, the use of an hamming distance over binary codes affects the precision.
In this paper, we present Quicker ADC, which is a generalization addressing the limits of Quick ADC [2] .
First, Quicker ADC introduces two new features to improve performance and accuracy through the use of the latest revision of SIMD instructions, namely AVX512: (i) irregular product quantizers combining sub-quantizer of different sizes to allow using 5-bit or 6-bit sub-quantizers, (ii) split tables for lookup tables larger than registers thus allowing efficient implementation of 8-bit sub-quantizer from 6-bit or 7-bit shuffles.
Second, Quicker ADC is implemented into the reference library FAISS [18] to allow comparison to reference optimized implementations of state-of-the-art schemes. It is open-sourced at http://github. com/technicolor-research/faiss-quickeradc/ easing comparisons to our schemes or their adoption.
Last, we compare the performance of Quicker ADC to PQ codes and Polysemous Codes with multiple index types (i.e., simple inverted index, inverted multi indexes, and inverted indexes based on HNSW) for both the SIFT1000M dataset and the Deep1B dataset. Quicker ADC consistently outperform Polysemous Codes [11] , which is the state of the art solution for fast response time. For example, on SIFT1000M, for budget of 0.25ms per query and 128-bit codes, Quicker ADC (variant 24×{6, 6, 4} ) achieves R@1 of 0.23 and R@100 of 0.60 with an IMI (K = 4096 2 ) and R@1 of 0.24 and R@100 of 0.68 with IVF HNSW (K = 2 18 ), when Polysemous codes achieve R@1 of 0.23 and R@100 of 0.47 with an IMI and R@1 of 0.18 and R@100 of 0.36 with IVF HNSW.
BACKGROUND
In this section, we first review the data-structures and algorithms commonly used for nearest neighbor search with product quantization. We then analyze the impact of product-quantization parameters on search speed and recall. Finally, we introduce the capabilities of the latest processors supporting AVX-512 and their potential for supporting product quantization.
Nearest neighbor search with PQ
We describe the different steps for building an indexed database of vectors, and how to store compressed representations of these vectors thanks to product quantization. This database can then be used to search for the nearest neighbor of a query vector efficiently by identifying a subset of database vectors for which the distances to the query vector needs to be computed, and computing efficiently these distances using an efficient procedure called ADC that operates on the compressed representations.
Index.
Computing distances for all vectors of a 1-billion vector database would be prohibitive. To tackle this issue in highdimensional spaces, the common approach is to split the database into partitions using a coarse indexing structure [5, 19] . The input vector space is partitionned into K Voronoi cells using a coarsequantizer q i . Vectors lying in each cell are stored in an inverted list. At query time, the inverted index is used to find the ma closest cells to the query vector, and distances for vectors in the inverted lists of these cells are computed. The two most popular approaches are inverted indexes (IVF) [19, 20] and multi-indexes [5] . The number of cells is limited in IVF (e.g., K = 65536) due to the cost of training the IVF and computing distances to each cell. Multi-indexes [5] lift this limit and allow a more fine-grained index (e.g., K = 2 24 ) at the price of imbalance in inverted list sizes. Recently, alternative approaches have leveraged nearest neighbor graphs to allow faster navigation in the index while avoiding the imbalance in inverted list sizes (e.g., HNSW) [9, 26] for improved performance and accuracy. Our work is orthogonal to these and compatible with any index.
In order to fit the complete database into memory, short codes, which are much more compact, are stored instead of full vectors. To obtain the short code, the residual r(x) = x − q i (x) is encoded using a product quantizer described in the next section. Indexed databases therefore use two quantizers: a quantizer for the index (q i ) and a product quantizer to encode residuals into short codes. The energy of residuals r(x) is smaller than the energy of input vectors x, thus there is a lower quantization error when encoding residuals rather than input vectors x into short codes. In the rest of the paper, we will note y = r(x).
As a special case, when product quantization is used without an index, all vectors are stored in a single list. The short codes represent the input vectors rather than the residuals (i.e., y = x).
Short codes with product quantization.
Vector quantizers. To encode residual vectors as short codes, PQ builds on vector quantizers. A vector quantizer q maps a vector y ∈ R d , to a vector c i ∈ R d belonging to a predefined set of vectors C. Vectors c i are called centroids, and the set of centroids C, of cardinality k, is the codebook. Generally, the quantizer is chosen so that it map the vector y to its closest centroid c i q(y) = arg min
The quantizer extends as an encoder enc which encodes y into the index i ∈ {0 . . . k − 1} of the vector c i it is mapped onto (i.e., enc(y) = i, such that q(y) = c i ). The short code i only occupies b = ⌈log 2 (k)⌉ bits, which is typically much lower the d · 32 bits occupied by a vector y ∈ R d stored as an array of d single-precision floats (32 bit each).
To maintain the quantization error low enough for ANN search, a very large codebook e.g., k = 2 64 or k = 2 128 is required. However, training such codebooks is not tractable both in terms of processing and memory requirements.
Product quantizers. Product quantizers overcome this issue by dividing a vector y ∈ R d into m sub-vectors, y = (y 0 , . . . , y m−1 ). Each sub-vector y j ∈ R d /m , j ∈ {0, . . . , m − 1} is quantized independently using a sub-quantizer q j . Each sub-quantizer q j has a distinct codebook C j = (c
i=0 of cardinality k. The cardinality of the product quantizer codebook C = C 0 × · · · × C m−1 is k m . Thus, a product quantizer has many centroids k m while only requiring storing and training m codebooks of cardinality k. A product quantizer can encode a vector y into a short code, by concatenating codes produced by sub-quantizers:
The short code
Interestingly, the order of vectors in the codebook C j is not constrained. Thus one can choose how vectors of the codebook are mapped to indexes. This freedom allows storing additional information and has been used in [1] to nest a 4-bit product quantizer into the 8-bit product quantizer and in [11] to encode a binary code onto the index. In both case, this allows pruning the computation thanks to the approximation of distance provided by the 4-bit product quantizer or binary code. With binary code, the resulting combination is called polysemous codes [11] and achieves state-of-the-art performance for approximate nearest neighbor search with product quantization (and inverted multi-indexes).
2.1.3 Search in the compressed domain. PQ-based search identifies the nearest neighbor by computing the distance between the query vector z and a subset of database vectors.
As a first step, the ma closest cells of the index quantizer q i are determined (typically, ma = 8 to 64 for IVFs). The inverted lists of these cells correspond to all the candidate vectors for which the distances must be computed. For efficiency reasons, the distance is computed directly on the compressed representation using a procedure called ADC.
ADC works the following way. First, for each cell, the residual z ′ = r(z) of the query vector is computed (z ′ = z if no index is used). From this residual z ′ , a set of m lookup tables are computed {D j } m j=0 , where m is the number of sub-quantizers of the product quantizer. The jth lookup table comprises the distance between the jth sub-vector of z ′ and all centroids of the jth sub-quantizer:
Second all candidates are scanned and the lookup tables are used to compute the distance between the query vector z and each short code c as follows:
Thus, ADC computes the distance between a query vector z and each code c by summing the distances between the sub-vectors of z ′ and centroids associated with code c in the m sub-spaces of the product quantizer. As the number of codes in inverted lists is large compared to k, the number of centroids of sub-quantizers, using lookup tables avoids computing ∥z ′j − C j [i]∥ 2 for the same i multiple times. Also, lookup tables provide a significant speedup by performing the computation directly in the compressed domain rather than reconstructing (i.e., decompressing) database vectors.
Impact of PQ parameters.
The two parameters of a product quantizer, m, the number of sub-quantizers and k, the number of centroids of each sub-quantizer impact: (1) the memory usage of codes, (2) the recall of ANN search and (3) search speed. The first tradeoff is between memory usage and accuracy. Both the memory usage of codes (⌈log 2 (k m )⌉ bits = m · b bits, where b = ⌈log 2 (k)⌉) and accuracy increase with the total number of centroids of the product quantizer (k m ). In practice, 64-bit codes (2 64 centroids) or 128-bit codes (2 128 centroids) are used in most cases.
The second tradeoff is between ANN accuracy and search speed. For a constant memory budget of m · b bits per code, the respective values of m and b impact accuracy and speed. Decreasing m, which implies increasing b, increases accuracy [19] . A detailled analysis of the impact of m and b on performance is given in [1] . In a nutshell, b impacts the time for each lookup in the table: if b is too large, the lookup table does not fit into the fastest memory (i.e., the processor cache, which is limited in capacity) and lookup time will increase significantly. m impacts the number of lookups: thus as long as the table stay in the fastest memory, the lower the m, the better the performance. In addition, the cost of computing the lookup tables grows exponentially with b, thus smaller b also impact performance by reducing the cost of computing the tables; this last impact is best seen on fine-grained indexes where the table computation cost becomes significant.
The standard notation for Product Quantization codes is m×b which specifies both parameters. The most common Product Quantization codes are m×8 (e.g., PQ 8×8 or PQ 16×8) as they ensure that tables fit in the processor cache and are not too costly to compute while being efficient to compute as they align well to computer bytes, and allow accessing the code without shifting nor masking.
ADC computation using SIMD
The common parametrizations of PQ (i.e., PQ m×8 ) already exploit the fastest memory available on processors (i.e., the L1 cache), leaving no room for easy improvement. A common technique to improve performance is to use instructions that process a vector of values rather than a single value at each CPU cycle: this principle called SIMD Single Instruction Multiple Data allows significant performance boost for signal processing and matrix operations. Yet, for PQ, moving to SIMD improves performance for additions but the implementation remains bottlenecked by the memory accessed [1] . Indeed SIMD does not allow an efficient implementation of in-memory table lookups, even using gather instructions introduced in recent processors [1, 15] . While SIMD can add up to 16 floating-point numbers (512 bits) at once, only 2 concurrent memory accesses can be performed per cycle in each CPU core. Those memory accesses are the bottleneck in Product Quantization.
Thus, previous work [1, 2, 10] moved lookup tables from memory to SIMD registers and leveraged in-register shuffles to implement lookups 1 . Yet, the width of SIMD registers (128-512 bits) challenges this approach. Indeed, for common PQ (i.e., PQ m×8 ), each lookup table occupies 8192 bits (k = 2 8 = 256 floats). Previous work worked around this limit by (i) using 4-bit subquantizers and (ii) quantizing floats to 8-bit integers, using only the 7-bit positive range. The resulting lookup tables are thus small enough (128 bits) to fit SSE or AVX-2 registers. This has allowed Quick ADC [2] to achieve significant performance improvement with a moderate loss of accuracy. This loss of accuracy comes mainly from the reduced precision of 4-bit subquantizers when compared to 8-bit subquantizers and to a smaller extent from the use of quantized distances.
All these work [2, 10] are limited to PQ m×4 as they follow the initial proposition of [1] to use pshufb.
SIMD capabilities and alternatives
AVX512, introduced in 2017 with Xeon Scalable processors, is a significant redesign of Intel's SIMD instruction set. It provides numerous additional shuffle instructions that can support larger tables as described in Table 1 . Currently available processors allow lookup tables of 32 or 64 16-bit values, thus 5-bit or 6-bit indexed lookup tables (i.e., m×5 or m×6 PQ codes). In addition, the wider registers (4x when compared to SSE and 2x when compared to AVX2) allow either an improved parallelism or more precise distances (16-bit instead of 8-bit). In the next section, we will explore how these new capabilities can be exploited for improving PQ efficiency.
Interestingly, shuffle instruction pshufb (used in, e.g., m×4 PQ codes) can process between 128 (SSE) and 512 (AVX-512) bits of data per cycle while the popcount instruction (binary codes) can process only 64 bits of data per cycle. Hamming distance (used in, e.g., Polysemous codes) is often considered as much faster than product quantization's ADC (i.e., with in-memory lookup tables) thanks to these fast popcount instruction. Yet, pshufb is even faster 2 allowing product quantization to rival binary codes regarding distance computation speed. As a side note, our paper focus on ADC, yet hamming distance is conceptually closer to SDC [19] ; SDC would trade accuracy for speed by avoiding costs associated to distance tables computation. This could be used to provide small product quantization codes competing directly with binary codes.
While AVX512 has improved capabilities (larger tables) and increased parallelism (i.e., number of lookups per cycle) thanks to wider registers, the gain obtained may be partially cancelled by the fact that processor cores running AVX2 or AVX512 code are down-clocked and thus run at a lower frequency than processor cores running sequential or SSE code [16] . Thus, an experimental evaluation is needed to assess the potential of these instructions for PQ-based applications. This is even more salient for non-exhaustive search as the overall query time is the result of the index search, the lookup table computation if any, and the ADC-based scanning of all candidates whose performance change differently as the code or index is altered.
QUICKER ADC
In this section, we describe Quicker ADC, a generalization of Quick ADC [2] that aims at improving accuracy through the use of latest AVX-512 SIMD instructions. Interestingly, AVX-512 provides shuffles indexed by 5 or 6 bits, thus allowing a more precise quantization of vectors. Yet, their use for ADC is not straightforward as practical constraints prevent the use of m×5 or m×6 PQ codes.
Product quantizers m×8 or m×16, that are commonly used, are composed of 8-bit or 16-bit sub-quantizers. This choice stems from the fact that manipulating byte-aligned or word-aligned values is both simpler and faster. Earlier work on using SIMD for PQ departs from these choices by relying on m×4 PQ codes yet benefits from the fact that 4-bit values naturally align to bytes as 2×4 = 8 bits.
Unfortunately, m×5 or m×6 PQ codes rely on 5-bit or 6-bit values, neither of which align well to computer words (16 bits) or bytes (8 bits) . This prevents a computationally efficient implementation of such PQ codes which is the purpose of using SIMD.
A naive approach could be to add padding by storing 3 × 5 bits of PQ codes in each 16-bit word and leaving one unused bit in each word. Our initial experiments with this approach on an exhaustive search in the SIFT1M dataset show that a 16×4 PQ (R@1 3 of 0.159) outperforms a 12×5 PQ codes (R@1 of 0.153). Our hypothesis is that using only 60 bits out of 64 bits outweights the benefits of using 5-bit subquantizers in place of 4-bit subquantizers.
To this end, we explore two solutions to avoid padding. The first solution, relying on new Irregular PQ codes, is presented in Section 3.1. The second solution combines multiple large shuffles to implement each lookup for 8-bit subquantizers; it anticipates the availability of AVX512 VBMI 7-bit shuffle in a near future. Both solutions require an SIMD-compatible memory layout similar to the one of Quick ADC [2] and described in Section 3.3. Finally, Quicker ADC improves the distance quantization of Quick ADC [2] in order to allow using all 8 bits and not just 7 bits for improved accuracy as explained in Section 3.4.
Irregular PQ
As a first solution to alleviate the alignement issue, we propose Irregular Product Quantization which combines subquantizers of different sizes (4,5 and 6 bits) that can all be implemented in SIMD, such that their combination aligns well to words (16-bits). A first variant groups one 6-bit subquantizer with two 5-bit subquantizers for a total of 16-bits and a second variant groups two 6-bit subquantizers with one 4-bit subquantizers also for a total of 16-bits. Multiple such groups are combined to form the complete Product Quantizer. We will use the notation m×{a, b, c} for an Irregular Product Quantizer formed of m sub-quantizers grouped by three sub-quantizers of a bits, b bits and c bits. For example, the rest of the paper will often consider the 64-bit product quantizer 12×{6, 6, 4} that has the following subquantizers [6, 6, 4, 6, 6, 4, 6, 6, 4, 6, 6, 4] . We note д the number of sub-quantizers grouped (e.g., д = 3 for m×{6, 6, 4} and д = 2 for m×{4, 4} ); so that m/д is the number of groups.
With subquantizers of different precisions, the allocation of input dimensions to subquantizers cannot be uniform. With product quantizers [19] , the input vector is split in sub-vectors of equal dimension which are then quantized independently. With irregular product quantizers, however, the input vector is not split in sub-vectors of equal dimensions so as to leverage the improved representation capabilities of finer subquantizers. We map input dimensions proportionally to the bit-width of the subquantizer. Let us consider a 128-dimension SIFT vector quantized by an irregular product quantizer 12×{6, 6, 4}. Each of the 4 groups of sub-quantizers is associated to 32 dimensions. Thus within each group, 6-bit sub-quantizers quantize sub-vectors of 12 dimensions and 4-bit subquantizers quantize sub-vectors of 8 dimensions. Note that as long as m/д is a multiple of 2, Irregular Product Quantizers . . .
. . . Table 2 : SIMD operations required to perform 8-bit lookups remain compatible with multi-indexes which requires the two subquantizers of the index to be aligned with the subquantizers of the product quantizer. If the number of input dimension cannot be divided exactly, the remaining dimensions are added one by one to the subquantizers. Yet, this alters performance and should be avoided: one should always ensure that allocation can be proportional. More specifically, when vectors are pre-processed by a PCA or a rotation (like in OPQ), the pre-processed vectors should be of a dimension which can be divided exactly. This implies that for a m×{6, 6, 4}, the number of dimensions of the pre-processed vectors must be divisible by (3+3+2)m/3 = 8m/3, and for a m×{6, 5, 5} the number of dimension of the pre-processed vector must be divisible by (6 + 5 + 5)m/3 = 16m/3. For example, 128-dimension SIFT vectors can be encoded optimally by both 12×{6, 6, 4} or 12×{6, 5, 5} irregular PQ codes.
To validate this first solution, we compare the different 64-bit codes using an exhaustive search in the SIFT1M dataset. Both 12×{6, 6, 4} (R@1 of 0.179) and 12×{6, 5, 5} (R@1 of 0.174) outperform a 16×4 PQ code (R@1 of 0.158). Additional results are given in the evaluation in Table 3 .
Split tables
AVX-512 brings 6-bit (vpermi2w) and 7-bit (vpermi2b) shuffles. These are only 2-bit away and 1-bit away from the very common 8-bit sub-quantizers. Rather than relying on highly-imbalanced irregular product quantizers, in particular a m×{7, 1} that would perform poorly, it becomes interesting to consider that each 8-bit lookup table can be split into 4 6-bit lookup tables or 2 7-bit lookup tables. As an example, we show an 8-bit lookup built from two 7-bit shuffles (vpermi2b) on Figure 1 . The distance table D is split in two halves (D 0 and D 1 , each of which occupies 2 registers). Two shuffles (each having 3 registers as inputs) are performed to get values indexed by the low 7-bits (e.g., a 1..7 0 )). The final values are then selected through a blend indexed by the 8-th bit (e.g., a 8 0 ). This approach cannot be built on the 4-bit shuffle (pshufb from SSE4/AVX2) as it would require too many shuffles and blends. As shown in Table 2 , performing an 8-bit lookup for 16 values using SSE4 requires 16 shuffles and 15 blends, an average of 1.94 operations/value. In comparison, when using AVX512 VBMI, only 0.05 operations/value are necessary. Due to the lack of available processors with AVX512 VBMI, we will evaluate performance only for AVX512 BW (0.21 operations/value); yet, it is clear that vpermi2b from AVX512 VBMI will provide significant gains 4 .
Quicker ADC codes implemented with this approach require more instructions per lookup than irregular product quantizers, yet, they come with almost no compromises on accuracy when compared to PQ m×8 PQ codes. In the evaluation, we'll see that in some context they outperform the alternative approach already, but they will become particularly interesting as AVX512 VBMI-capable processors become available in the near future.
Memory layout
Similarly to Quick ADC, Quicker ADC requires a transposed memory layout. Indeed, an SIMD in-register shuffle performs multiple lookups at once, but in a single lookup table e.g., D 0 . Therefore, shuffles must operate on a single component of multiple codes (e.g., a 0 , . . . , p 0 ) at once, and not on the multiple components of a single code (e.g., a 0 , . . . , a 15 ). Hence, to allow efficient loads from memory, all values of the SIMD register a 0 , . . . , p 0 must be contiguous in memory, which is not the case with the memory layout of inverted lists (Figure 2a ) used in common PQ implementations [11, 19] .
The size of the blocks transposed depends on the shuffle used. In addition, values have a fixed width that depends on the shuffle instruction used: pshufb or vpermb operate on bytes (8 bits) while vpermw or vpermi2w operate on words (16 bits). Hence, multiple subcodes must be packed together to form bytes or words. We use the notation of irregular product quantizers to specify the packing applied (e.g., m×{6, 6, 4} packs 3 subcodes to form a 16-bit word (6 + 6 + 4 = 16) as shown on Figure 2c ). This notation extends to regular product quantizers (e.g., Quick ADC [2] 's m×{4, 4} packs 2 4-bit subcodes to form an 8-bit byte (4 + 4 = 8) as shown on Figure 2b ). Split-table-based m×{8, 8} packs two 8-bit subcodes to form a word for use with vpermi2w, and split-table-based m×{8} 4 The software we release already includes an implementation supporting AVX512 VBMI even if it couldn't be evaluated yet. 
Quantization of distances
In standard ADC, lookup tables store partial distances as 32-bit floats. As subquantizer precision is key to accuracy, we seek to store partial distances as 8-bit or 16-bit integers so as to allow lookup tables of the same size, yet storing twice or four times as much values. The representation (8-bit or 16-bit integers) depends on whether the type of shuffle we use operates on bytes (e.g., pshufb, vpermi2b) or words (e.g, vpermw, vpermi2w).
As we are interested only in the top-k nearest neighbors, our distance quantization scheme must represent as precisely as possible the smallest distances, but can ignore (i.e., quantize to ∞) large distances. Thus, to perform distance computations (additions, ...), we rely on saturated integer arithmetics that handles ∞ through saturation. The approach is thus similar to that of Quick ADC but provides a tighter distance quantization as explained hereafter.
First, we use 8-bit and 16-bit unsigned integers whereas Quick ADC uses only the 7-bit positive range of signed integers: this doubles the precision. However, unsigned saturated arithmetic is more complex to use: values 0 and 2 b − 1 (i.e., 255 for 8-bit and 65534 for 16-bit) are sticky values thus corresponding to −∞ and ∞. As a consequence x + 0 = 0 and x + (2 b − 1) = 2 b − 1. The lack of neutral value (i.e., y such that x +y = x) makes the implementation more complex. Each partial distance must be encoded between 1 and q max = 2 b − m − 2, and the summed distance are thus between m + 1 (resulting from the m + 1 additions of 1) and 2 b − 2. In addition, we must use a combination of intrisics to perform the comparisons [23] .
Second, we perform a tighter evaluation of the minimum and maximum values than in Quick ADC [2] in order to allow a more precise quantization. For each of the m lookup tables, we evaluate the smallest partial distance p min (i) to represent, which is the smallest partial distance in the i-th table. This also gives us the smallest distance to represent d min = m i=0 p min (i). Then, we scan init vectors to find a candidate set of R nearest neighbor candidates, where R is the number of nearest neighbors requested by the user (e.g., R=100 when R@100 is evaluated). We use the distance of the query vector to the Rth nearest neighbor candidate i.e., the farthest nearest neighbor candidate, as the d max bound. All subsequent candidates will need to be closer to the query vector, thus d max is the maximum distance we need to represent.
We determine the size of each quantization bin ∆ =
. Each partial distance p in the i-th distance table can thus be quantized as q = 1 + p − p min (i) ∆ To unquantize the sum Σq, one can use the following equation
Note that similarly to [1, 2] , we learn our distance quantizer at query time rather than at training time [10] : the required evaluation of a few distances has a negligible impact on performance yet allows consistently increased accuracy.
SIMD distance computation
Quicker ADC supports several combinations of sub-quantizers: (i) those operating on 128 bits lanes for 4-bit subquantizers with 8-bit distances (SSE, AVX2 and AVX512), (ii) those operating on 512 bits lanes for 4,5 and 6-bit subquantizers with 16-bit distances, and (iii) those operating on 512 bits lanes for 8-bit subquantizers with 8 or 16-bit distances implemented using multiple shuffles and blends as described in Section 3.2 and Figure 1 . While implementation details vary, the overall principle is the same.
Once cells are selected and distance tables are computed, as explained in the background, each invert list is scanned block by block. The distances for vectors of the block are computed in the following way. We depict the processing applied to each group of components (i.e., each row of Figure 2b and 2c) in Figure 3 . First, the subcodes are unpacked using shifts and masks. For each set of subcodes, the partial distances are looked up in the distance table using either a native shuffle, as explained in Section 3.1 or a lookup implemented through a combination of shuffle and blends, as explained in Section 3.2 and depicted in Figure 1 . The distances are summed using saturated arithmetic. Note that distance tables may occupy half, one or multiple registers depending on the type of lookup. Figure 3 presents an hypothetical 12×{6, 5, 4} code in which distance tables for 6-bit, 5-bit and 4-bit subquantizers fit in respectively 2, 1 and half a register. This process is repeated for all m/д groups that form the complete code and partial distances are summed to obtain the distances. The distances are then compared to the worst vector already selected in a binary heap, and vectors for which distances are smaller are extracted and added to the binary heap. PQ m×8 Polysemous m×{4,4} m×{6,6,4} m×{6,6,5} m×{8,8} Table 3 : Exhaustive search (without index) on SIFT1M. Timings and recalls in bold are plotted graphically.
. . . 0 shift and mask Note that the 8-bit in-register shuffle (pshufb) in AVX2 or AVX-512 BW operate concurrently on two or four independent 128-bit lanes. Hence, they cannot be used to implement 32-element or 64-element lookup tables. Yet, they increase the throughput by processing more elements per cycle (see Table 1 ). We use this property in Quicker ADC (m×{4, 4}) by processing groups (i.e., rows) 2 by 2 (AVX2) or 4 by 4 (AVX512) rather than 1 by 1 (SSE). The number of iterations in the computation is thus reduced from m/д to m/д/2 (AVX2) or m/д/4 (AVX-512).
EVALUATION
We implemented QuickerADC in C++ (4K lines of code) and release it as open-source 5 . The implementation contains numerous variants: m×{4, 4}, m×{6, 6, 4}, m×{6, 5, 5}, m×{5, 5, 5}, m×{8, 8}, m×{8} 6 . Note that, to allow further experimentation, the released code is highly generic thanks to templates so that adding a m×{6, 6, 2} operating on signed arithmetics requires a single line of code. Training, exhaustive search and non-exhaustive search (IVF, multi-indexes) rely on the implementation of FAISS.
We carry our experiments on Skylake-based servers, which are m5 AWS instances, built around Intel Xeon Platinum 8175 (2.5 GHz, supporting AVX512) with default settings from Amazon. We use g++ compiler version 7.3 with option -O3 and enable SSE, AVX, AVX2 and AVX512. For BLAS, we use the Intel MKL 2018.
Our evaluation relies on the publicly available datasets SIFT1M [19] and SIFT1000M [20] of 128-dimension SIFT vectors, and Deep1B [7] of 96-dimension Deep Features. The first is a dataset of 1 million vectors that we will use for the evaluation of exhaustive search, and the latters are datasets of 1 billion vectors that we will use for the evaluation of non-exhaustive search (i.e., with an index). Our metrics are Recall@1 (R@1), which is the fraction of queries for which the true neighbor is the one returned during search, and Recall@100 (R@100), which is the fraction of queries for which the true neighbor is among the top-100 returned during search. R@100 reflects the performance for visual search applications where the user is presented a collection of images rather than a single image (e.g., Google Image). To evaluate the computational efficiency, we report the average time per query. Note that a query time of 0.5ms translates to a throughput of 20 000 queries/second.
Exhaustive search
We first focus on evaluating the performance of Quicker ADC in isolation. We thus consider exhaustive search in the SIFT1M dataset. We do not use an inverted index and we encode the original vectors, not residuals, into short codes.
We scan init = 400 vectors to set the qmax bound for the quantization of lookup tables (Section 3.4). Going beyond 400 vectors can marginally improve accuracy by reducing the distance quantization error but also increase the proportion of codes scanned slowly. Table 3 gives results for the baseline product quantization implementation from FAISS [21] , for polysemous codes [11] and for Quicker ADC. We also include a specific operating point where polysemous codes degenerate into binary codes (τ = 0).
We compare the SIMD implementations with quantized distances to sequential implementations with floating-point distances. This allows us to quantify the loss of recall resulting from distance quantization. Recall is not impacted by the quantization of distances; thus, scanning 400 vectors is enough for estimating the distancequantization bounds. Our sequential implementation tend to be slower than original PQ ADC because it is not as specialized and it systematically uses shifting and masking for accessing subcodes.
Quicker ADC m×{6, 6, 4} is as fast as m×{4, 4} or m×{5, 5, 5} yet improves recall for both 64-bit and 128-bit codes. Its recall R@1 (0.179 and 0.313) is lower than PQ and Polysemous (0.225 and 0.447) but the recall R@100 of Quicker ADC is equivalent if not better than the one of polysemous codes. Yet, this slight recall decrease allows Quicker ADC m×{6, 6, 4} to be 10 times faster than PQ ADC and 2-3 times faster than Polysemous codes.
Quicker ADC m×{8, 8} codes are 5x faster than regular PQ codes, and slightly faster than polysemous codes while achieving similar recall R@1 and improved recall R@100. Note that timings reported here are using vpermi2w from AVX512 BW and are likely to be up to 4x better with the availability of AVX512 VBMI without significant degradation of accuracy. Thus, m×{8} codes with Quicker ADC could prove extremely interesting due to their recall being very close to the original PQ with significantly improved performance.
Quicker ADC is faster than a binary code (polysemous with τ = 0) as the shuffle instructions used in Quicker ADC are faster than popcount. This opens perspectives for using PQ with Symmetric Distance Computation with an SIMD implementation similar to that of Quicker ADC in order to build a very fast code that could be used in pruning applications (e.g., in place of the hamming code of polysemous codes). Bounding a PQ m×8 with a PQ m×4 code is one of the two mechanisms used in [1] ; it could be used in isolation as explained in Derived Quantizers [13] , in a way similar to the hamming code of Polysemous codes [11] .
Non-exhaustive search
Every single index and product quantizer combination can operate in numerous configurations by varying the parameters (i.e., How many cells to explore? How many distances to evaluate? What hamming threshold to use? How to estimate distance quantizers?). As a result, the performance of each combination is not a single point but a curve of the optimal tradeoffs between query time (in ms) and recall (R@1 or R@100). Hence, to compare the various combinations of indexes and product quantizers, we plot the best recall achieved for a given query time budget. We are particularly interested by very short query times (less than 0.5ms).
We consider 3-types of indexes combined with either 64-bit codes or 128-bit codes. The first index considered is a relatively coarse index (IVF K = 65536) [19] , the second is a very fine-grained index (inverted-multi-index, IMI K = 4096 2 ) [3] , and the last one is a fine-grained index leveraging a neighborhood graph (IVF HNSW K = 2 18 ) [26] . The two latters are considered state-of-the-art and are widely used; they achieve similar performance and the more recent IVF HNSW produces fewer inverted lists, which are also more balanced. We report results for the SIFT1000M dataset [20] on Figure 4 and for the Deep1B dataset [7] ) on Figure 5 . Note that gaps in the curves (quite visible for 128-bit PQ and polysemous codes with IVF HNSW) are related to the fact that parametrizations are discrete (i.e., exploring 1,2,3,... inverted lists).
For all indexes, both codes (64-bit and 128-bit) and both datasets (SIFT1000M and Deep1B), the top performer is one of Quicker ADC's implementations, with a single exception (SIFT1000M, IMI, 128-bit codes for the metric R@1, with a budget for query time > 0.2 ms). Indeed, as observed in Section 4.2, polysemous codes tend to perform better on R@1; they also avoid distance table computation which are numerous in IMI. The domination of Quicker ADC is particularly salient for low query time budget, where the gain recall of Quicker ADC over polysemous codes can be 50% (SIFT100M, IMI, 128-bit, 0.2 ms) or 100% (SIFT 1000M, IVF HSNW, 128-bit, 0.2ms). Quicker ADC is also particularly efficient for metric R@100, or for Deep 1B vectors. For example, on Deep1B with an IMI 2x12 with a query time budget of 0.25ms, Quicker ADC 32×{4, 4} allows a R@100 of 0.55 while polysemous allow a R@100 of 0.33 or would require a query time of 0.5ms to achieve the same R@100, and PQ PQ 16×8 requires more than 1ms to achieve a recall R@100 of 0.55. Quicker ADC becomes particularly interesting when combined with IVF HNSW: for example, 32×{4, 4} achieves a R@100 of 0.55 in less than 0.16ms; faster than alternatives on IMI. Indeed, IMI is a worst case for Quicker ADC as these indexes tend to have numerous short lists, and Quicker ADC requires (i) pre-computing distance tables, and (ii) processing batches of vectors. The benefits are higher for indexes that have longer inverted lists (IVF, IVF HNSW).
Note that for index types other than IMI, polysemous codes allow additional operating points over PQ (lower query time by jeopardizing recall) but do not improve performance for the larger query times. It means that while the reduced cost per code of polysemous allows computing distances for more codes/more inverted lists, an equally effective alternative is to not use polysemous but scan fewer codes/less inverted lists. The fact that they are more benefical in the context of IMI can be explained by the fact that they allow avoiding distance table computations, that are more numerous for such fine-grained indexes. Interestingly, their benefits are limited on IVF HNSW.
Summary of experimental study
Our experiments show that Quicker ADC offers interesting operating points for both exhaustive search and non-exhaustive search, even with fine-grained indexes. For exhaustive search, new variants relying on AVX512 are the fastest, with m×{8, 8} offering good recall for both R@1 and R@100 with significantly reduced query times. For non-exhaustive search, Quicker ADC m×{4, 4} or m×{6, 6, 4} outperforms other solutions, including polysemous when fast queries are desired. As the time budget for queries increases, or when R@1 is the metric of interest, Quicker ADC m×{8, 8} is among the top performers. Hence, Quicker ADC m×{8} whose ADC procedure could be up to 4x faster (see Table 2 ) is likely to be well positioned when processors supporting the required instructions will become available.
RELATED WORK
SIMD evaluation of distances. PQ Fast Scan [1] pioneered the use of SIMD for ADC distance evaluation. It inspired later work [2, 10] that are more suitable for indexed databases. These later propositions have shown to be particularly efficient yet their evaluation remain limited to no or coarse indexes and thus lack results for fine grained indexes (e.g., Inverted Multi Index [5] ). By implementing Quicker ADC into FAISS, we have been able to evaluate Quicker ADC with both IMI and IVF HNSW. Also, as we release the implementation, it will be possible to evaluate Quicker ADC for future index designs. In addition, these works were limited to PQ m×4 codes, while we have proposed numerous other variants for a better usage of AVX512-compatible processors.
Indexes. Inverted Multi-Indexes [5] or graph-based inverted indexes [9, 26] provide a finer partition than vanilla inverted indexes. The design of indexes is rather independent from the design of the product quantizer, and thus Quicker ADC can be combined with any inverted index. We evaluate Quicker ADC with several types of indexes and shows that it works fine with IMI which are at the extreme of spectrum in term of cell sizes and works equally well, contrary to polysemous codes with more recent propositions such as IVF HNSW. We expect Quicker ADC to be relatively independent of index choices and to remain highly efficient with newer indexes.
Optimized and Compositional Quantization Models. Cartesian kmeans (CKM) [28] and Optimized Product Quantizers (OPQ) [14] optimize the sub-space decomposition by performing an arbitrary rotation and permutation of vector components. This allows for improved accuracy with a moderate cost (i.e., a matrix multiplication to perform the rotation). These are compatible with Quicker ADC as the ADC procedure remains unchanged. The source code we release includes FAISS's implementation of OPQ. We focused our evaluation on the combination of Quicker ADC with regular PQ with various indexes. In addition compositional vector quantization models inspired by PQ have been proposed. These models offer a lower quantization error than PQ or OPQ. Among these models are Additive Quantization (AQ) [4] , Tree Quantization (TQ) [6] and Composite Quantization (CQ) [33] . These models also use cacheresident lookup tables to compute distances, therefore Quicker ADC may be combined with them. However, this may require additional work as some of these models use more lookup tables than the ADC procedure of PQ or OPQ.
Deep-Learning-based quantizers. Subic [17] , DPQ [22] and [32] use deep neural networks to compute a compact vector representating images. Similarly to product quantization, the compact vector has a product structure which is exploited to compute distances by summing the contribution of sub-vectors. The distances to each subvectors are thus stored in lookup tables and an ADC-like procedure is used. Hence, Quicker ADC naturally extends to these quantizers, and can bring similar benefits. Quicker ADC could be particularly interesting if these quantizers can be adapted to accomodate well small quantizers (4-bits) or irregular (4,5 and 6-bits combined).
Encodings based on neighborhood graphs. Some propositions [8, 12] leverage the nearest neighbor graph to have lower encoding error. This improve recall but tend to operate with a higher memory budget [12] and thus do not compare directly. Our work target operating points strictly identical to IVF or inverted multi-indexes combined with Product Quantization [5, 19] and Polysemous Codes [11] . Yet, as these propositions [8, 12] rely on lookup tables for distances computation, they could leverage some principles from Quicker ADC to speed up distance computation.
CONCLUSION
In this paper, we presented Quicker ADC, a novel distance computation method for product-quantization-based ANN search. Quicker ADC improves over previous proposition [2] by (i) supporting additional quantizers (e.g., m×{6, 6, 4} , m×{8, 8} , m×{8} ) and (ii) having an improved implementation integrated into FAISS and compatible with various indexes (IMI, IVF HNSW). Through an extensive evaluation, we have shown that Quicker ADC outperform schemes based on PQ or polysemous codes for both exhaustive and non-exhaustive (i.e., index-based) search, and that they combine well with the latest indexes such as IVF based on HNSW [26] . Finally, we release the implementation as open-source to allow a wider adoption and evaluation of this approach.
Techniques presented in this paper focus on the efficient evaluation of distances in the compressed domain. This problem is present in all quantization-based approaches, which rely on lookup tables to speed-up computation. Thus, the principles behind our work (i.e., replacing memory accesses by shuffles and quantizing distance tables) can be the basis for an improved implementation of other approaches [8, 12, 17] . This is of particular interest if those new approaches behave better with coarser (i.e., 4-bit) or irregular quantizers (i.e., combined 6-bit and 5-bit quantizers). Also, Quicker ADC brings product quantizer codes on par with binary codes regarding distance evaluation speed. Thus, Quicker ADC could inspire new designs where filtering with a binary code (e.g., Polysemous codes) is replaced by filtering with a lower precision product quantizer (e.g., m×{4, 4} with symmetric distance computation used to filter vectors before ADC computation on PQ m×8 as mentionned in [1, 13] ).
Finally, we would like to stress that upcoming processors will have improved SIMD capabilities allowing for a bright future for Quicker ADC. The Cannonlake processors expected in 2019 will have support for 7-bit shuffles thus quadrupling shuffle throughput as m×{8} codes replace m×{8, 8} , and Sunny Cove processors expected in 2020 will, in addition, double shuffle throughput by having two shuffle units per core instead of one. Hence, Quicker ADC's performance will significantly improve in a near future just from hardware upgrade, without any algorithm adaptation.
