Due to the emergence of new network applications, current IP lookup engines must support high bandwidth, low lookup latency, and the ongoing growth of IPv6 networks. However, the existing solutions are not designed to address jointly these three requirements. This paper introduces SHIP, an IPv6 lookup algorithm that exploits prefix characteristics to build a data structure designed to meet future application requirements. Based on the prefix length distribution and prefix density, prefixes are first clustered into groups sharing similar characteristics and then encoded in hybrid trie-trees. The resulting memory-efficient and scalable data structure can be stored in low-latency memories and allows the traversal process to be parallelized and pipelined in order to support high packet bandwidth in hardware. In addition, SHIP supports incremental updates. Evaluated on real and synthetic IPv6 prefix tables, SHIP has a logarithmic scaling factor in terms of the number of memory accesses and a linear memory consumption scaling. Compared with other well-known approaches, SHIP reduces the required amount of memory per prefix by 87%. When implemented on a state-of-the-art field-programmable gate array (FPGA), the proposed architecture can support processing 588 million packets per second.
which represents the number of valid prefix bits. While a destination IP address may match multiple entries in the FIB, only the NHI associated with the longest prefix matched is returned [3] .
IP lookup algorithms and architectures tailored for IPv4 technology are not performing well with IPv6 [2] , [4] , due to the fourfold increase in the number of bits in IPv6 addresses over IPv4. Thus, dedicated IPv6 lookup methods are needed to support upcoming IPv6 traffic.
IP lookup engines must be optimized for high bandwidth, low latency, and scalability for two reasons. First, due to the convergence of wired and mobile networks, many future applications require high bandwidth and low latency at the same time, such as virtual reality, remote object manipulation, eHealth, autonomous driving [5] . Second, IPv6 FIBs are expected to grow as IPv6 technology is still being deployed [6] , [7] . However, current solutions presented in the literature are not jointly addressing these performance requirements.
In this paper, we introduce SHIP: a Scalable and High Performance IPv6 lookup algorithm designed to meet current and future performance requirements. SHIP is built around the analysis of prefix characteristics. Three main contributions are presented:
• Two-level prefix grouping that clusters IPv6 prefixes in groups sharing common characteristics. Prefixes are divided into bins based on their 23 most significant bits, while the bins are stored in a hash table. Within each bin, prefixes are sorted in groups, based on the FIB prefix length distribution. • An hybrid trie-tree (HTT), a shallow and memory efficient data structure. Each group of prefixes is encoded in an HTT. The HTT revisits the concept of multi-bit trie, but adapts the number of nodes built at each level on the prefix density. At the bottom of the trie, multiple leaves are encoded into a leaf bucket. • A high-throughput implementation of SHIP targeting Field Programmable Gate Arrays (FPGAs). The implementation is based on an architecture that performs a pipelined traversal of the SHIP data structure, and leverages on-chip memory to support a high lookup rate, while balancing the lookup latency. In reported characterization results, SHIP stores 580 k prefixes and the associated NHI using less than 5.2 MB of memory. Evaluated on multiple benchmarks, SHIP achieves a linear memory consumption scaling and a logarithmic latency scaling. SHIP also supports incremental updates. In addition, when implemented on FPGA, the architecture can support a high-throughput of 588 million packets per second.
The remainder of this paper is organized as follows. Section II introduces common approaches used for IP lookup. Section III gives an overview of SHIP. Then, two-level prefix grouping is presented in Section IV. The proposed hybrid trietree is covered in Section V. Section VI presents a method to update the SHIP data structure and its cost. Section VII introduces the method and metrics used for performance evaluation and Section VIII presents simulation results, while Section IX presents FPGA implementation results. Section X compares SHIP performance with other methods. Lastly, we conclude the work by summarizing our main findings and results in Section XI.
II. RELATED WORK
Traditionally, IP lookup is implemented using TCAMs, a specialized hardware with O(1) lookup time complexity. A TCAM is an associative memory that matches a key simultaneously against all prefixes. However, TCAMs combine a high power consumption and a high cost, making them unattractive for routers holding a large number of prefixes [8] , [9] . TCAM implementations on FPGAs also yield a relatively poor performance [10] .
As a result, algorithmic solutions have been explored as an alternative to TCAMs. Algorithmic solutions emulate the TCAM functionality using data structures. Four main types of data structures are used by algorithmic solutions: hash tables, Bloom filters, tries and trees. The challenge is to encode efficiently prefixes, which are loosely structured and with a highly nonuniform prefix length distribution. In addition, for any given prefix length, prefix density ranges from sparse to very dense, where the prefix density represents the sparseness of non-empty nodes within a level of a trie.
Interest for using a hash table is twofold. First, a hash function aims at distributing uniformly a large number of keys over bins independently of the key structure. Second, a hash table provides O(1) lookup time on average and O(N ) space complexity, where N is the number of prefixes. However, a pure hash-based LPM solution has a lookup time complexity of O(W ), where W is the number of distinct prefix lengths, as one hash table is probed per prefix length. Bando et al. [11] proposed to expand prefixes to few prefix lengths, at the cost of an increased memory consumption. Waldvogel et al. [12] proposed a binary search on prefix lengths to reduce the lookup time complexity to O(log(W )) with a space complexity of O(N · log(W )). Still, a hash function can generate collisions that degrades performance. To reduce the number of collisions, a method that exploits multiple hash tables [11] , [13] was proposed. This method divides the prefix table into groups of prefixes, and selects a hash function to minimize the number of collisions within each prefix group [11] , [13] . The solution presented by Zhou and Prasanna [14] leverages perfect hash functions, which are guaranteed to be collisionfree. Although their solution achieves a very high throughput when implemented on GPU, a large memory space is required and no incremental update support is provided.
Bloom filters, space-efficient probabilistic data structures, have also been used in the literature to select a set of prefixes that may match an IP address [15] [16] [17] [18] . Recent studies have shown that Bloom filters can significantly reduce the average lookup time [15] , [17] , [18] . However, by design, this data structure generates false positives independent of the configuration parameters used. Thus, a Bloom filter can lead to poor performance in the worst case, when many prefix sets are selected. B-trees, generalized self-balancing binary search trees, have also been explored [2] , [4] , [19] [20] [21] . Such data structures are tailored to store loosely structured data such as prefixes, as their time complexity of log(N ) and storage complexity of O(N ) are independent from the prefix distribution characteristics. However, B-trees can only be used with non-overlapped prefixes, i.e. disjoints two by two. Because converting prefixes into non-overlapped prefixes increases the number of prefixes, previous works focused on minimizing the prefix growth.
Hoang et al. divide a FIB into disjoint prefix groups using dynamic programming [2] , and build a B-tree per group. The FPGA implementation requires several external memories due to a low memory efficiency, which increases lookup latency. Chang et al. [21] proposed LayeredTrees, where the FIB is peeled iteratively into layers of disjoint prefixes. However, the throughput supported by their architecture is similar to previous works [2] . A variation of a balanced binary search tree was introduced by Zec et al. [19] and Zec and Mikuc [20] . Their solution is tailored for IPv4, and lead to performance degradation when adapted to IPv6 [22] . k-bit trie data structures are attractive because k bits of the IP address are compared at a time, and thus, they have a O(W/k) time complexity, where W is the IP address size. However, the low time complexity of k-bit tries comes at a large storage complexity of O(2 k N ·W/k), which lead to very low memory efficiency when the trie is built with unevenly distributed prefixes [23] , [24] . As a result, significant efforts were dedicated to reducing the memory footprint.
The level compression trie (LC-trie) [25] technique was proposed to combine multi-bit trie for dense regions with a level compressed binary trie for sparser regions [26] . Eatherton et al. proposed the tree bitmap structure, where k levels of a 1-bit trie are encoded into a bitmap [27] . The PC-trie, an improved tree bitmap, was proposed in the FlashTrie architecture [9] . However, the FlashTrie architecture requires multiple external memories, leading to a high lookup latency.
Bitmap solutions were also explored by Gaogang et al., who proposed the splitting approach to IP lookup (SAIL) [28] . SAIL decomposes the lookup into the identification of the prefix length, and then, the identification of the NHI. However the SAIL performance was only evaluated with the data structure used for the prefix length identification. Asia and Ohara presented Poptrie [22] , a CPU-optimized implementation of a k-bit trie. Poptrie encodes each node of a k-bit trie using a bitmap, and merges leaves covering the same prefix. Because PopTrie is tailored for IPv4 prefixes, the performance is reduced when using IPv6 prefixes.
Another solution proposed by Luo et al. [29] combines TCAM and k-bit trie to reduce the number of entries stored in the TCAM. Still, this solution is unattractive as most of the prefixes are stored in the TCAM.
GAMT [30] leverages the large amount of memory available on GPUs to implement a k-bit trie. This GPU-based solution supports a very high lookup rate, but suffers from a very high latency. Unoptimized k-bits tries were also evaluated on CPU, but were shown to provide a limited lookup rate [28] , [31] even when low-level CPU optimizations were used [32] .
Recently, new approaches exploiting information-theoretic and compressed data structures were proposed to compress a trie, yielding very compact data structures [33] , but only achieving a low lookup rate.
Multiple LPM solutions were implemented in software, as the high-operating frequencies of CPUs or GPUs can be leveraged to support very high lookup rates. However, from a system point-of-view, the lookup rate of software LPM solutions is limited by the rate supported by the packet I/O framework such as netmap [34] , or DPDK [35] . Indeed, these I/O frameworks cannot forward at very high-rates packets received from the network interface card (NIC) to the main memory [31] , [32] , [34] , [36] . In addition, most solutions were evaluated with packets already stored in the memory [14] , [19] , [20] , [22] , [28] , [30] . Lastly, software LPM solutions typically uses batch processing techniques to increase the bandwidth at the expense of latency [14] , [30] , [32] , [34] , [36] .
In summary, the solutions optimized for software implementation were not shown to support high lookup rates in complete systems. In addition, the solutions targeting hardware implementation generally suffer from a low throughput, or from low memory efficiency.
III. SHIP OVERVIEW
SHIP comprises a procedure to build an efficient data structure, and a procedure to traverse it, namely the lookup algorithm.
SHIP combines prefix clustering methods with memory efficient data structures. A clustering method, two-level prefix grouping, divides prefixes into address block bins (ABB) based on their 23 most significant bits (MSBs), and further divides prefix into prefix length sorted (PLS) groups based on the FIB prefix length distribution. The address block bins are recorded in a hash table, while the prefixes in each prefix length sorted group are encoded in an HTT.
The SHIP data structure is illustrated in Fig 1. An N-entry hash-table holds M valid address block bins (ABBs) obtained after dividing the prefixes on their MSBs. Each valid ABB holds a pointer to a set of K prefix length sorted (PLS) groups, encoded in HTTs.
The lookup algorithm identifies the NHI associated with the longest prefix matched. First, the MSBs of the destination IP address are hashed to select an ABB pointer. In this example, the m-th ABB is selected in the hash table. The selected ABB points to a set of K HTTs represented by the dashed rectangle. Second, using the least significant bits of the destination IP address, the selected HTTs are traversed in parallel. Because each of the HTT can hold a prefix matching the IP address, a priority resolution module is used to select the NHI associated with the longest prefix. 
IV. TWO-LEVEL PREFIX GROUPING
Two-level prefix grouping is a clustering method that divides prefixes in ABBs, and further sorts prefixes into prefix length groups. This clustering method is proposed to divide a prefix table in groups, each holding a fraction of the prefixes. The prefix distribution within groups is then leveraged by the HTT, as it will be shown in Section VIII.
A. Address Block Binning
Prefixes are binned in ABBs based on their 23 most significant bits (MSBs), and ABBs are recorded in a hash table.
The motivation to bin prefixes on 23 MSBs relates to the known IPv6 address space allocation. IPv6 prefixes are allocated from a pool of prefix blocks managed by the Internet Assigned Numbers Authority (IANA) ranging from /12 to /23 [37] . Because few prefix blocks are assigned [37] , when prefixes are binned on their 23 MSBs, a small number of bins are created. As a consequence, the number of prefixes in bins is reduced by up to two orders of magnitude over the original prefix table size (see Section VIII-A).
The ABB method leverages a perfect hash table [38] to store the bin values for two reasons. First, the bin values are almost static because they represent address spaces allocated to regional internet registries that are unlikely to be updated on a short time scale. Second, perfect hash functions guarantee an O(1) time complexity as no collisions are generated.
While the idea of using a hash table or direct index table to do a lookup on the MSB is not new for IPv4 [12] , [19] , [22] , [24] , the ABB method is optimized for IPv6 and differs from previous works by leveraging the known allocation of the IPv6 address space.
Because prefixes associated to an ABB can overlap, the prefix length sorting method is introduced to reduce the number of overlapping prefixes.
B. Prefix Length Sorting
PLS divides the prefixes associated with an ABB into groups based on the IPv6 prefix length distribution. This method sorts prefixes with length /24 to /64 into K groups. Each group cover a contiguous range of prefix lengths.
The prefix length range covered by each group is selected based on two principles. First, when a prefix length accounts for a large percentage of the total number of prefixes, the prefix length is used as an upper bound of the considered group. Second, prefix length ranges must be chosen such that the K groups are as balanced as possible in terms of the number of prefixes.
To illustrate those two principles, an analysis of prefix length distribution using a real prefix table [7] is presented in Fig. 2 . The first 23 prefix lengths are omitted in Fig. 2 , as the ABB method already bins prefixes based on their 23 MSBs. It can be observed in Fig. 2 that the prefix lengths with the largest cardinality are /32 and /48 for this example. Applying the two principles of prefix length sorting to this example, the first group covers prefix lengths from /24 to /32, and the second group covers the second peak, from /33 to /48. Finally, all remaining prefix lengths, from /49 to /64 are left in the third prefix length sorting group.
The first principle aims at minimizing the number of prefix overlaps inside a prefix group, by isolating a large number of prefixes from longer prefixes that can overlap. The second principle aims at balancing as much as possible the number of prefix overlaps between PLS groups, in order to obtain HTTs with relatively similar characteristics.
The prefix distribution within groups after appliyng the twolevel prefix grouping method is presented in Section VIII-A.
V. HYBRID TRIE-TREE DATA STRUCTURE
The proposed HTT encodes the prefixes held in each nonempty PLS group. The HTT data structure is tailored to adapt its shape to the characteristics of the prefixes used. An HTT combines a density-adaptive trie (DAT), a memory efficient multi-bit trie, and a leaf bucket (LB), which derives from a tree leaf.
A. Density-Adaptive Trie
The proposed density-adaptive trie revisits the concept of the multi-bit trie. It does so by adapting the number of nodes created based on the prefix density, which refers to the sparseness of non-empty nodes within a trie level, and to the prefix replication factor on contiguous nodes within a trie level.
Similarly to a k-bit trie, a DAT is applied iteratively on sub-tries, extracted from a binary trie holding a prefix set. A DAT encodes the leaves of a sub-trie after pushing prefixes to the leaves, as shown in Fig. 4 with the dashed arrows. The method to extract sub-tries from a binary trie is presented in Section V-C.2.
The main idea behind a DAT is to merge contiguous empty leaves or leaves recording the same prefix, as shown in Fig. 4a , and encode them in a single merged node, as shown in (Fig. 4b) . By contrast, a k-bit trie encodes each sub-trie leaf into a node, as illustrated in Fig. 4a .
To determine if two contiguous leaves can be merged, the number of prefixes covered by each leaf is evaluated, which includes the prefix held in the leaf and the number of prefixes held in the branch down the leaf. Two leaves are merged when the number of prefixes covered by the merged node is either (1) smaller than a fixed threshold, namely the size of a leaf bucket (see V-B), or (2) is not higher than the largest number of prefixes covered by each of the two leaves.
The merging method is applied from the leaves at both edges of a sub-trie, toward the center. For instance, in Fig. 4a , the merging method starts with the leaves associated to nodes N 0 and N 7 . For both directions, this method evaluates whether a leaf can be merged with the next contiguous leaf. The merging process is repeated until a leaf can no longer be merged with its next contiguous leaf. Otherwise, the method is repeated from the last leaf that was left unmerged.
The merging method has two constraints with respect to the number of merged leaves. First, a merged node encodes only a number of contiguous leaves that is a power of two, as the space covered by the merged nodes is represented using the prefix notation. Second, the total number of merged nodes is bounded by the adaptive trie node size.
The leaf indices evaluated (merged or not) by the merging method are encoded in the LtoH and HtoL arrays. The LtoH and HtoL arrays hold the index of leaves traversed from low to high indices, and high to low indices, respectively. All leaves not evaluated by the merging method are left unmerged, and are said to be encoded in a unmerged zone.
The benefits of the merging method used in a DAT is illustrated in Fig. 4 , where the number of nodes stored in memory after encoding a sub-trie is reduced from 8 using a 3-bit trie in Fig. 4a down to 2 using a DAT, as illustrated in Fig. 4b . With a DAT, both contiguous leaves storing prefix P 1 are merged, but also contiguous empty leaves.
A DAT is tailored for memory efficiency by merging contiguous nodes recording the same information. However, in Fig. 4b , a DAT requires another level to separate the prefix P 2 from prefix P 1 . Hence, to reduce the depth of a DAT, when the number of prefixes covered by a node is below a threshold value b, prefixes are encoded in a leaf bucket.
B. Leaf Bucket
A leaf bucket (LB) stores a set of up to b distinct prefixes and their NHI, covered by a DAT node and in the branch The proposed LB derives from a tree leaf, but only the prefix bits left unmatched are stored. The number of unmatched bits for each prefix is encoded in the LB, as well as the number of prefixes stored. By reducing the amount of information stored, the proposed LB improves the memory efficiency over a tree leaf and requires fewer memory accesses to be read. The interest of using an LB to encode prefixes is two-fold. When a sub-trie holds up to b sparsely distributed prefixes at the leaves, an LB requires fewer nodes to store the prefixes compared to a DAT. In addition, because most PLS groups hold very few prefixes (see Section VIII-A), a LB can encode in a single node a PLS group.
In Fig. 4c , DAT nodes covering two or fewer prefixes are encoded in LBs. Hence, using an HTT, which combines a DAT with LBs, the prefix set shown in Table 3 is encoded in a DAT root node with two child nodes LB 0 and LB 1 , as shown in Fig 5. The content of each LB is also presented in Fig. 5 .
C. HTT Build Procedure
The HTT build procedure is split into two parts: the main procedure and the selection of a sub-trie.
1) Main Procedure: The main procedure is initiated at the root node of a binary trie. If the number of prefixes stored in the binary trie is below a fixed threshold b, prefixes are encoded in a LB. Otherwise, an algorithm (see V-C.2) iteratively selects from the binary trie sub-tries, i.e. binary tries carved out from the main binary trie, that are encoded into HTT nodes.
For a selected sub-trie, the DAT merging method is applied on the leaves. Then, when a DAT node covers up to b prefixes, the prefixes are encoded in a LB. Otherwise, a DAT node 
is created and the process is repeated on each non-empty branches of the sub-trie.
An illustration is given in Fig. 4c and 5 for the prefix set shown in Table 3 . Prefixes are inserted in a binary trie, shown in Fig. 4a . Then, an algorithm, presented in Section V-C.2), selects from the binary trie a sub-trie to be encoded in HTT nodes. In Fig. 4a and 4c, the selected sub-trie is above the dashed red line. The sub-trie leaves are encoded in HTT nodes. The process is then repeated for each sub-trie below the red dashed line.
2) Selection of a Sub-Trie: A greedy algorithm [39] , presented as Algorithm 1, is used to extract at each iteration a subtrie from which a memory-efficient and shallow HTT is built.
The greedy algorithm extracts from a binary trie the deepest sub-trie respecting a memory consumption constraint. Because a deep sub-trie is selected at each iteration, a data structure with few levels is built, which in turns requires few memory access to traverse.
Starting from a binary trie node used as the root of the selected sub-trie, the algorithm increases iteratively the subtrie depth until a memory consumption constraint is violated.
The space measurement factor (Smpf ), used as a memory consumption constraint, is evaluated as the number of prefixes covered in the selected sub-trie and its branches multiplied by a constant, set here to 8 (line 1). Using a higher constant value favors the selection of a deeper sub-trie, at the cost of higher memory consumption. The constant used here was selected experimentally to provide a good trade-off between the subtrie depth and the memory consumption. 
Matched node in a non-merged zone?
10:
end if 12: end if 13: return Child node address = base address + of f set At each iteration, a space measurement (Sm) function estimates the memory consumption of the selected sub-trie by counting the number of prefixes held at the leaves and held in the branches (N um P ref ixes (leaf j )), to which is added a penalty term discussed below. The number of prefixes held at the leaves is evaluated after pushing the prefixes to the leaves of the sub-trie.
The heuristic selects a sub-trie enclosing the average prefix length of the branch. If the average prefix length in the selected branch is higher than the prefix lengths covered by the subtrie, Sm will grow at a slow rate, and a deeper sub-trie will be selected. To avoid the selection of a very deep sub-trie when Sm remains unchanged for many iterations, a penalty term is added to Sm, and this penalty is defined as the sum of Sm(i − 1) and the number of leaves of the selected sub-tries.
D. HTT Lookup Procedure
The HTT lookup algorithm starts with a traversal of the density-adaptive trie until a leaf bucket is reached. Then, the prefixes held in a leaf bucket are matched against a destination IP address, and the NHI of the longest matching prefix is returned.
1) Density-Adaptive Trie Traversal: When a DAT node is read from memory, the sub-trie encoded in the DAT node must be decoded to compute the address of the child node to read next. Using the LtoH and HtoL arrays stored in a DAT node, the sub-trie merged leaves can be identified.
In a DAT, the one-to-one mapping between a sub-trie leaf and a node illustrated in Fig. 4a for a k-bit trie does not hold because of the merging method, as shown in Fig. 4b and 4c . As a result, the memory location of a child node is a function of the number of merged leaves preceding the selected child node. The procedure to computed the child node address is presented in Algorithm 2. The address of the child node to visit next is computed in two steps. First, the zone holding the child node (LtoH array, HtoL array, or the unmerged zone) is identified as well as the child node index in the selected zone. Second, the number of merged leaves prior to the matched node index is evaluated to derive its address.
First, a segment of the IP address is extracted to select a sub-trie leaf which index is equal to the IP address segment. Both LtoH and HtoL arrays are used to identify whether the selected leaf index is covered by a merged node in the LtoH zone, the HtoL zone or the unmerged zone (lines 2, 6 and 9). In addition, if the LtoH zone or HtoL zone is selected, the array index of the merged node that contains the selected leaf index is identified (lines 3 and 7) . The array index, in the LtoH or HtoL array is noted p LtoH and p HtoL , respectively. If the matched node is held in the unmerged zone, the array index is the IP address segment.
Second, using the number of merged leaves preceding the child node index, the child node offset is derived. If the child node is in the LtoH zone, the offset is given directly by p LtoH (line 4). In a unmerged zone, the number of merged leaves, computed using the LtoH array, is subtracted from the IP address segment to obtain the child node offset (line 10). In the HtoL zone, the number of merged leaves, evaluated using both the LtoH and HtoL arrays, is subtracted from the child node index p HtoL (line 8). Lastly, the offset is added to the children base address (line 13).
Algorithm 2 is illustrated in Fig. 6 for the case L = 3 and IP seg = 10. The IP address segment matches a node held in the HtoL zone, as IP seg ≥ HtoL[0] (line 6). The node matched within the HtoL zone is stored at the array index p HtoL = 1 (line 7) as 9 ≤ IP seg ≤ 10 (line 7). The number of merged leaves up to the matched node is computed using both the HtoL and LtoH arrays (line 8). Based on the LtoH array, the number of merged leaves is LtoH[L − 1] − (L − 1) = 1 (line 9). As p HtoL +HtoL[0] = IP seg , no leaves are merged in the HtoL zone preceding the matched node. Thus, the offset of the matched child node is IP seg − 1 = 9.
2) Leaf Bucket Matching: The DAT is traversed until a leaf bucket is reached. The leaf is first parsed, and then prefixes are read. Next, all prefixes are matched against the destination IP address, and their prefix length is recorded if matches are positive. When all the prefixes are matched, only the longest prefix matched is returned with its NHI.
VI. UPDATE SUPPORT
First, an analysis of the updates characteristics is presented, then the cost of updates using the SHIP algorithm is evaluated.
Similar to other works, an offline procedure identifies the nodes of the SHIP data structure to modify [9] , [22] , [28] , [40] upon receiving a prefix update.
A. Analysis of the Prefix Updates
Three types of prefix updates can be applied on a FIB; NHI modification of an existing prefix, prefix insertion, prefix deletion. The updates received by the RIS remote route collector rrc00 [7] , from 2017/12/1 to the 2017/12/2 included, are shown in Fig. 7 . Based on Fig. 7 , peaks of 2000 NHI modifications per second can be observed, whereas the peak prefix insertion and deletion rates are relatively similar at around 550 updates per second.
However, the effective deletion and insertion rates can be significantly reduced. Indeed, as a router can exchange network state information with multiple routers, the same update information can be received by a single router multiple times. Moreover, a prefix can be withdrawn and inserted many times within minutes following a network link failure. Indeed, previous works have shown that the mean time observed to recover after a network link failure is in the order of minutes [40] . As the information is received multiple times from neighboring routers, a prefix can be withdrawn and inserted many times within minutes following a network link failure. In summary, by delaying deletions and pushing a single update per prefix per timestamp, the effective prefix addition and deletion rates drop to less than 20 updates per second.
B. Update Cost
We now evaluate the cost of updates by counting the number of memory accesses to modify the SHIP data structure. Let C HT T and C hash table be the update cost of the HTTs and the hash table, respectively. The insertion or a deletion of a prefix can trigger a hash table update and an update of a HTT, independently of the prefix length. If the 23 MSBs of the updated prefix are not already associated to an ABB held in the hash table, both the hash table and the HTT must be updated. Otherwise, only a HTT is updated. C hash table is first evaluated. When a prefix update requires deletion of an ABB in the hash table, a single memory access is required to invalidate the associated entry. However, the entire hash table must be rebuilt to add an ABB.
To reduce the update complexity, we propose to build a hash table holding all the ABBs associated to the current IPv6 unicast address space [37] . Thus, when a prefix is added, the associated ABB entry is enabled with a single memory access. In addition, less than a single update per second was observed to apply to the hash table. Hence, C hash table = insertion rate × #memoryAccess perinsertion = 1. The hash table holding all the ABBs uses 54 kB, which is negligible compared to the SHIP data structure size. In addition, the IPv6 unicast address space was not modified since 2006, and its usage is still extremely low, making the proposed method valid for future IPv6 network growth.
The cost of updates is now evaluated for the HTTs. When a prefix NHI is updated, a single memory access is required to update an HTT leaf. However, the prefix insertions or deletions require to rebuild a portion of an HTT, or a complete HTT. On all the benchmarks used, the number of nodes in an HTT is smaller than the number of prefixes held. As a result, the number of nodes to modify after a prefix update is at most equal to n, the number of prefixes held in an HTT. Because most of the prefix updates observed are applied to the HTTs, C HT T = insertion rate × n + deletion rate × n + N HI update rate .
The cost of updates on the SHIP data structure is C HT T + C hash table = 20 × n + 20 × n + 2000 + 1. Hence, the update complexity of the SHIP architecture is O(n) ≈ O(N ), where N is the number of prefixes in the FIB. Using our largest prefix table, the largest number of prefixes encoded in an HTT is n = 7, 000. Thus, in the worst case, up to 282,001 memory accesses are required each second to update the SHIP data structure.
Handling up to 282,001 memory accesses per second in the worst case has a very limited impact on the lookup rate that is on the order of several hundred of million of lookups per second.
To update the SHIP data structure in the proposed pipelined hardware architecture, write bubbles are used. Write bubbles insertion [40] is a technique widely adopted to push updates to a pipelined hardware architecture. A write bubble holds a triplet (pipeline stage, memory address, value) that is inserted in the pipeline. When a write bubble reaches the pipeline stage specified in its triplet, a control circuitry updates the memory address with the new value held in the triplet.
VII. PERFORMANCE MEASUREMENT METHODOLOGY
SHIP's performance is evaluated using eleven real prefix tables, holding approximately 25 k prefixes, extracted from the RIS remote route collectors [7] . Each scenario, noted rrc followed by a two-digit number, characterizes the location in the network of the remote route collector used. In addition, synthetic prefix tables were also used. One synthetic 580 k prefix table was generated using a non-random method [41] . Four smaller prefix tables were created from the 580 k prefix table, with a similar prefix length distribution, holding respectively 290 k, 116 k, 58 k and 29 k prefixes.
Because SHIP uses two-level prefix grouping to cluster prefixes, the prefix distribution after clustering is evaluated in Section VIII-A. The prefix distribution is evaluated using a small real prefix table, rrc00, and the largest prefix table with 580 k prefixes. SHIP performance is characterized by the number of memory accesses to traverse the data structure and its memory footprint. Two cases were considered; one without clustering, i.e where a single HTT encodes all prefixes, and one where prefixes are clustered using two level prefix grouping and encoded in multiple HTTs. For the second case, the number K ranges from 1 to 6. The results are reported in Sections VIII-B and VIII-C.
The memory bus of the presented architectures allows to read one HTT node per clock cycle. In addition, the selected K HTTs within an ABB are traversed in parallel. The reported number of memory accesses is the largest number of memory accesses between all the HTTs amongst all ABBs.
The HTT memory consumption is given in bytes per byte of prefixes to capture the data structure overhead. This metric is evaluated as the size of the data structure divided by the sum of the size of each prefix held in the prefix table.
To characterize the HTT efficiency per level, the HTT node distribution, the HTTs depth distribution, and the prefix distribution are evaluated. Only the maximal depth is considered for the last two metrics. Combined together, these metrics allow to evaluate the average number of prefixes encoded per HTT node, which directly reflects the HTT efficiency. These metrics are evaluated using K = 2 PLS groups, similarly to the parameters used for the clustering analysis and for the FPGA implementation, presented in Section IX.
VIII. RESULTS

A. Prefix Distribution Within Clusters
As presented in Section IV, the motivation to bin prefixes upon their MSBs lies in the IPv6 addresses structure [37] .
In this section, we demonstrate experimentally that prefixes can be binned using an ABB width set to 23, i.e. the 23 MSBs.
In Fig. 8 , the impact of the ABB width is evaluated on the prefix distribution, number of bins and number of prefixes. In this figure, the two clustering methods ABB and PLS are applied to prefixes. The number of ABBs bins and the number of prefixes are normalized for an ABB width set to 23 bits, because the proposed clustering method bins prefixes on their 23 MSBs.
Using real prefixes with an ABB width equal to or greater than 23 bits distributes more evenly the prefixes along PLS groups shown in Fig. 8a . Using ABB widths greater than 23 does not help to reduce the maximum number of prefixes held within a PLS group, as illustrated in the box-plot with outliers. In addition, using ABB widths greater than 23 bits has little to no impact on the total number of prefixes (up to 26 bits), but the number of bins has a superlinear growth.
Using synthetic prefixes, a more even distribution is observed for ABB widths greater than or equal to 24 bits. Although using an ABB width greater than 23 bits has very little impact on the total number of prefixes, the number of bins has a superlinear growth with the ABB width.
In conclusion, our experiments with real and synthetic prefixes have shown that using 23 bits is a good trade-off.
B. ABB Hash Table
The performance of the hash table recording the ABB pointers is reported in Table I . For real prefixes, the ABB method uses between 19 kB and 24 kB. The memory consumption is similar across all the scenarios tested because prefixes share most of the 23 MSBs. Using synthetic prefixes, on average, 2.7 bytes of memory per prefix byte are used for the 5 scenarios evaluated. The hash table shows a linear memory consumption scaling. The number of memory accesses is constant to 2 by construction for all scenarios, because a perfect hash function is used.
C. HTTs
Real Prefixes: The performance of HTTs is presented in Fig. 9a and 9b . Two-level prefix grouping reduces the memory consumption and smooths its variability. The memory consumption of the HTTs ranges from 1.36 to 1.60 bytes per prefix byte for all scenarios with two-level prefix grouping. By contrast, without clustering, the memory consumption ranges between 1.22 up to 3.15 bytes per byte of prefix for a single HTT encoding the whole prefix set. Fig. 9a shows that increasing K up to 3 reduces the memory consumption, although using more groups worsens the memory consumption. Indeed when K is increased, most groups hold very few prefixes, which leads to a large of portion of the memory allocated to the HTT being unused.
The number of memory accesses reduces on average by 2× using the clustering method. While the number of memory access ranges between 9 and 18 without clustering for a single HTT, using two-level prefix grouping, it ranges between 6 and 9, as observed in Fig. 9b . However, increasing K from 1 to 6 yields little gain on the number of memory accesses. Indeed, we observed that the number of memory accesses is limited by a few PLS groups holding only \ 48 prefixes. Hence, increasing K cannot reduce the number of prefixes held in these PLS groups, which results in improvement to the number of memory access. Table II shows that the average HTT depth is 1.4 and 2.1 for PLS group 1 and 2, respectively. Around 75% of HTTs have a depth equal to one because 75% of the PLS groups hold fewer than 2 prefixes, as presented in Fig. 8 . Indeed, in an HTT, two prefixes can be stored in a single node. Experimentally, an average of 1.3 prefixes and 1.4 prefixes are encoded per HTT node at the level one for PLS group 1 and 2, respectively. For levels greater than one, the average number of prefixes encoded per HTT node is 1.5 and 1.0 for PLS groups 1 and 2, respectively. Based on Fig. 8 , HTT with a depth greater than one encode more than 75% of the prefixes.
Synthetic Prefixes: The performance of the HTTs with synthetic prefixes is presented in Fig. 10a, 10b , and 10c. Two behaviors can be observed for the memory consumption in Fig. 10b . For prefix tables holding 290 k prefixes or more, using two-level prefix grouping with K = 2 groups slightly decreases the memory consumption over the case of a single HTT (i.e. without clustering). In addition, using K > 2 does not improve memory efficiency. For smaller prefix tables with up to 116 k prefixes, a lower memory consumption is achieved using only a single HTT (i.e without clustering). Indeed, using synthetic prefix tables holding up to 116 k prefixes, most of the PLS groups hold a single prefix. Thus, for each PLS group, a large of portion of the memory allocated to the HTT is unused, which reduces the memory efficiency.
The memory consumption scaling of the HTTs is presented in Fig. 10c . This memory consumption, with and without two-level prefix grouping, grows linearly with the number of prefixes. Note that the abscissa uses a logarithmic scale. Thus, the memory consumption scaling of the HTT is linear with and without two-level prefix grouping.
Based on Fig. 10a , using two-level prefix grouping reduces the number of memory accesses over a single HTT. On average, the number of memory accesses is reduced by 40% over a single HTT using 2 groups or more. However, using K > 3 does not further reduce the number of memory accesses. Indeed, the PLS groups causing the largest number of memory accesses cannot be reduced in size by increasing the number of groups. Finally, Fig. 10a shows that the increase in the number of memory accesses for a search is at most logarithmic with the number of prefixes, since each curve is approximately linear and the x-axis is logarithmic.
In terms of HTT efficiency, the same conclusions can be drawn with synthetic prefixes. Based on Table II , the average HTT depth is 1.5 for both PLS groups 1 and 2. 80% of HTTs have a depth equal to one, which is coherent with the prefix distribution presented in Fig. 2 . Experimentally, on average, 1.5 and 1.3 prefixes are respectively encoded per HTT node at the level one for PLS groups 1 and 2. For levels greater than one, the average number of prefixes encode per HTT node is 2.0 and 2.1 for PLS groups 1 and 2, respectively.
Lastly, the HTTs used with two-level prefix grouping have a linear memory consumption scaling, and a logarithmic scaling for the number of memory accesses. The hash table used in the ABB method has shown to offer a linear memory consumption scaling and a fixed number of memory accesses. Thus, SHIP has a linear memory consumption scaling and a logarithmic scaling for the number of memory accesses using all the considered benchmarks.
IX. FPGA IMPLEMENTATION
Two FPGA hardware implementations of the SHIP architecture were developed and characterized. One was optimized to reduce latency while the other was optimized to offer high throughput.
A. Architectures Overview
The two proposed architectures implement a pipelined traversal of the SHIP data structure. The HTT traversal module of both architectures is divided in K = 2 parallel pipelines. Each pipeline i ∈ [1, . . . , K] performs the traversal of the HTT selected by the root pointer address held in the i-th PLS group.
For the low-latency architecture, the HTT traversal module pipeline is divided into d stages, such that each pipeline stage is dedicated to the traversal of a single HTT level. A pipeline stage j ∈ [1, . . . , d] is composed of a Traversal Engine (TE) and a memory, holding the HTT nodes of level j of all the ABBs (Bin 1 to Bin M ). The traversal engines of the first d − 1 levels are used to process only DAT nodes, whereas LB nodes processing is only done at the level d. The low-latency lookup architecture was presented in a previous paper [42] .
To increase the lookup rate over the low-latency architecture, the high-throughput architecture divides each pipeline stage of the low-latency architecture over multiple stages. By reducing the number of logic levels in each stage, shorter clock periods can be obtained, which also increased the lookup rate.
The two architectures were described in C++, synthesized with Vivado High Level Synthesis (HLS) 2018.1, and implemented using Vivado 2018.1 with the Virtex7 VX690TFFG1761 and the UltraScale+ XCVU9P-FLGA2577. The method used to describe accurately the SHIP architecture using C++ is presented in [42] .
B. Results
For each design presented in Table III , four performance metrics are of interest; the lookup rate, the latency, the on-chip memory usage (BRAM) and the logic usage. The lookup latency, or wall-clock time, is evaluated as the clock period multiplied by the number of pipeline stages.
The results of the low-latency architecture are discussed in a previous paper [42] . In this section, the performance of the high-throughput architecture is first discussed. Then, the overhead of the high-throughput architecture is compared to the low-latency architecture.
1) High-Throughput Architecture: The lookup rate of the high-throughput architecture implemented on a Virtex7 degrades with the prefix table size. Using 25 k prefixes, up to 344 Mpps is supported, while the lookup rate is reduced to 263 Mpps when 290 k prefixes are used, and is further reduced to 166 Mpps with the largest prefix table.
The lookup rate degradation with the prefix table size relates to the distribution of the on-chip memories in the FPGA. On-chip memories, block RAMs, are divided in columns. When a level of the SHIP data structure is mapped to more than one block RAM column, the clock period increases due to the routing delay to access multiple columns. As a direct consequence of an increased routing delay, the lookup rate is reduced.
The latency also increases with the prefix table size as a consequence of the clock period degradation for larger scenarios. For prefix table holding up to 290 k prefixes, the lookup latency ranges between 189.5 and 246.4 ns, while the lookup latency increases to 396.3 for the largest prefix table.
The block RAM usage is almost linear with the prefix table size, which is a consequence of the linear memory consumption observed in Fig. 10c .
As a consequence of the higher consumption of block RAMs for larger scenario, the FPGA logic usage increases with the prefix table size. Between the smallest scenario and the largest scenario, the logic usage increases by up to 61%. Indeed, the circuitry needed to combine individual block RAMs into a large memory consumes FPGA logic.
As the high-throughput architecture performance on the Virtex7 is limited by routing delays, we evaluated this architecture on a state-of-the-art FPGA, an UltraScale+, which features deeper block RAMs columns and reduced routing delays. On this FPGA, the clock period is reduced by 1.7× over a Virtex7 for the largest scenario. Consequently, the lookup rate is increased to 294 Mpps, and the lookup latency is reduced to 219.2 ns. In addition, the logic consumption is reduced 1.6× over the Virtex 7 due to modified internal organization of the UltraScale+ that allows to build large memories in the block RAM column without additional use of logic resources.
To further increase the lookup rate, the high-throughput architecture can be modified to double the lookup rate. The proposed technique consists in using two parallel lookup engines sharing dual-port block RAMs available on both the Virtex7 and the UltraScale+. A single dual-port memory block can serve simultaneously the two lookup engines. As a result, the high-throughput architecture with dual-port memory blocks can support 332 Mpps and 588 Mpps on the Virtex 7 and on the UltraScale+, respectively. However, because two parallel lookup engines are implemented, the logic consumption is almost doubled.
2) Overhead of the High-Throughput Architecture: The high-throughput architecture almost triples the lookup rate over the low-latency architecture when using a Virtex7, and by more than 3.3× using an UltraScale+, as shown in Table III . However, the high-throughput architecture uses up to 1.8× more FPGA logic than the low latency architecture because of its deeper pipeline, which requires FPGA logic to synchronize intermediate results in each pipeline stage. Still, the increased lookup rate of the high-throughput architecture outweighs the logic consumption overhead compared to the low-latency architecture. In addition, the on-chip memory usage is not impacted by the pipeline depth, and thus, the consumption of block RAMs remains similar for the two architectures.
The down side of the high-throughput architecture lies in the increased lookup latency compared to the low-latency architecture. As the number of pipeline stages is increased by 5.4×, while the clock period is reduced by less than 3×, the lookup latency increases by 3× up to 396 ns for the largest prefix table compared to the low-latency architecture on a Virtex7. Using an UltraScale+, the lookup latency of the highthroughput architecture is only increased by 1.7×, to 219 ns, over the low-latency architecture.
X. COMPARISON WITH PREVIOUSLY REPORTED RESULTS
A. Performance Comparison
To normalize the performance between the various works summarized in Table IV , only the performance of a single lookup engine is reported without using dual-port memories. Table IV compares the performance of SHIP and previous work in terms of lookup latency, lookup rate, memory consumption, logic consumption, and update complexity. The memory consumption is expressed in bytes per prefix, as the size of the data structure divided by the number of prefixes used. The hypothesis used to evaluate the lookup latency of solutions using external memory such as SRAM or DRAM were presented in a previous paper [42] .
A FPGA-based TCAM was proposed [10] , which supports 84 Mpps and uses around 777.5 equivalent Xilinx block-RAMs. Thus, this solution has a 3.5× lower lookup rate than SHIP, while consuming 9× more memory space. However, the update complexity of this solution is lower compared to SHIP.
The Xilinx LPM IP used in SDNet [43] has a low FPGA resource consumption. Compared with our high-throughput architecture with 116 k prefixes, the Xilinx solution lookup rate is 1.6× slower, but has a 1.5× lower lookup latency. In addition, the Xilinx LPM solution requires a shallow data structure to support update, doubling the reported memory consumption compared to the high-throughput architecture. In addition, its update complexity is higher compared to SHIP, where only one HTT needs to be rebuilt after a prefix insertion or deletion. The 2-3 trees architecture [2] has a lookup latency 1.3× higher than the high-throughput implemented on an UltraScale+. In addition the bandwidth supported is 1.54 × lower than our proposed architecture, and is 87% less memory efficient than the high-throughput architecture. However, the 2-3 tree architecture has an update complexity of O(log(N )), whereas SHIP requires to rebuild an HTT that contains a subset of the prefix table.
Our high-throughput architecture exhibits a memory consumption that is 2.6× lower than the LayeredTree architecture [21] while supporting a 1.45× higher lookup rate. However, the high-throughput architecture has a higher lookup latency and uses 3.2× more FPGA logic. The lower usage of FPGA logic by the LayeredTree architecture is a consequence of the very small prefix table used, which requires very few pipeline stages and block RAMs. In addition, the LayeredTree architecture has an update complexity of O(log(N )), whereas SHIP has an update complexity of O(N ).
The memory efficiency of the LC-trie is 4.4× lower than SHIP for the same prefix set. Moreover, for the same prefix set, the LC-trie algorithm requires in the best case 18 memory accesses, whereas SHIP requires 10 memory access. The update complexity of a LC-trie is not reported in [25] .
The results presented for FlashTrie [13] were reevaluated using the node size equation presented in [13] due to incoherence with equations shown in [9] . In addition, the bandwidth supported by the FlashTrie architecture drops to 88 Mpps when the DRAM timing characteristics are fully considered. Compared to the high-throughput architecture implemented on the UltraScale+, the FlashTrie architecture has a lookup rate 3.3× lower, uses roughly 3.4× less FPGA resources, but consumes 11× more memory and requires 1.34× more time to complete a lookup. The update complexity of FlashTrie is O(N ), which is similar to SHIP.
PopTrie [22] , uses 72 bytes per prefixes while supporting 211 million lookups per second. Thus, PopTrie uses almost 7× more memory per prefix, while supporting 1.4× fewer lookups compared to SHIP implemented on an UltraScale+. The latency for IPv6 prefix is not reported.
A new method presented by Rétvári et al. [33] yields highly memory-efficient data structures for IPv4 prefixes. However, compared to the high-throughput architecture, the memory consumption is roughly 25× lower, but the lookup rate is 42× lower. Thus, the benefit of the compressed data structure does not outweigh the reduced lookup rate compared to SHIP.
The performance reported for the SAIL architecture [28] is not evaluated with a module that identifies the prefix matching for a given prefix length. However, based on an evaluation of the PopTrie's authors [22] , a complete lookup solution based on the SAIL framework uses 87 bytes per IPv4 prefix, for a table holding 520 k IPv4 prefixes. Thus even for IPv4 prefixes, SAIL consumes around 8× more memory compared to SHIP on IPv6 prefixes.
Tong et al. proposed the CLIPS architecture [44] extended to IPv6 [4] . Their method uses 27.6 bytes per prefix, which is about 2.5× larger than SHIP. The estimated latency of CLIPS 544.5 ns, which is 2.4× higher compared to the proposed architecture on the UltraScale+. In addition, CLIPS uses more than 2.4× the number of bytes per prefixes than the proposed architecture, with a lookup rate 1.6× lower than the high-throughput architecture implemented on an UltraScale+.
B. Impact of the FPGA Generation on the Performance
In this section, we discuss the impact of the FPGA generation on the overall performance, which is ignored in the previous works.
A precise comparison would require a full implementation of each work across all FPGA generations, but a simple assumption can be made. In the best case, the performance of the architectures presented in Table IV is limited only by the speed of the on-chip memory, and not by the number of logic levels. Thus, as a first-order approximation, we assume that the architectures presented in Table IV could run at a  clock period similar to the SHIP high-throughput architecture  on an UltraScale+. Based on this assumption, the performance of the different works are compared against SHIP. The Xilinx LPM solution [43] is not discussed as this solution is limited to 64 k prefixes. Similarly, as the SAIL architecture is not a complete LPM solution, SAIL is discarded from the following discussion.
A lower clock period resulting from the use of a state-ofthe art FPGA would benefit to pipelined architectures that are not limited by the performance of some external memory. That is, the lookup rate of most of the architectures presented in Table IV , would be equal to the one supported on the high-throughput architecture. However, the lookup rate of the FlashTrie architecture would not be improved, as the performance bottleneck is the external DRAM. Similarly, the compressed data structure architecture is not pipelined, and thus will lightly benefit from a higher clock rate. Even with a lower clock period, the CLIPS architecture would have a higher latency compared to SHIP, while the memory consumption would remain higher than SHIP. Similarly, the 2-3 Tree architecture would have a slightly higher latency compared to SHIP, while the memory consumption would remain 80% higher than SHIP.
The LayeredTree architecture would benefit from a lower clock period, as the latency would be 3× smaller than the highthroughput architecture. However, the memory consumption of LayeredTree would remain 2.6× higher than SHIP.
In conclusion, even if the previously published architectures and the high-throughput architecture could support the same lookup rate, the high-throughput architecture still has a higher memory efficiency and a lower lookup latency. Only one solution would have a lower lookup latency compared the high-throughput architecture.
XI. CONCLUSION
This paper proposed SHIP, a scalable and high performance IPv6 lookup algorithm to address the performance requirements of current and future network applications. SHIP exploits the prefix characteristics to create a shallow and memory-efficient data structure. First, the allocated IPv6 address space is used to bin the prefixes efficiently on their MSBs. Within each bin, prefixes are sorted in groups based on the FIB prefix length distribution. Each prefix group is then encoded in a Hybrid Trie-Tree (HTT). The proposed HTT revisits the concept of multi-bit trie, but adapts the number of nodes built on the prefix density to improve the memory efficiency. The trie leaves are transformed into leaf buckets leaves to further improve the memory efficiency.
The proposed data structure supports incremental updates and is mapped efficiently to hardware. A pipelined high-throughput hardware architecture is proposed, which leverages on-chip memories.
Evaluated with real and synthetic prefix tables, holding up to 580 k IPv6 prefixes, SHIP exhibits a logarithmic scaling factor in terms of the number of memory accesses and a linear memory consumption scaling. Compared to other well-known approaches, SHIP reduces the required amount of memory per prefix by 87%. Implemented on an UltraScale+ FPGA, using the largest considered benchmark, the proposed highthroughput architecture can support a 588 millon packets per second throughput.
