The massive growth in the use of Internet and the development of new real-time applications has put considerable strain on the techniques currently used for the lookup and retrieval of information essential for classification, routing, Quality of Service (QoS) and Internet security. This paper investigates the design and implementation of a number of closest value lookup circuits, suitable for deployment in a range of networking applications. Detailed descriptions of a number of matching circuit architectures are given and the results of hardware implementations for the Altera Stratix II FPGA are discussed and evaluated.
Introduction
The Internet phenomenon has penetrated society to such an extent that it has gone beyond a vague commercial interest to an absolute necessity for many businesses. As fast as this evolution has been, the physical infrastructure of the network was never designed to deal with many of the issues internet service providers are now facing. Customer requirements for lower end-to-end propagation delays, bandwidth and QoS (Quality of Service) guarantees cannot be suitably achieved with the ubiquitous best effort model.
Applications such as VoIP (Voice over Internet Protocol), streaming audio and video, database enquiries and other specialised applications have specific bandwidth and propagation delay requirements. Other traffic such as FTP, HTTP and popular peer-to-peer transfer applications are of a lower priority but cannot currently be differentiated from high priority data due to the way the Internet operates.
The existing network needs to become more sophisticated to allow it to meet the changing desires of different users. Greater bandwidth requires faster transmission of packets which in turn requires faster search, lookup and other means of information retrieval for individual packets, paths and packet flows. In fact interactive services are usually based on small packets to reduce end to end delays. As a result the packet classification, lookup and scheduling speed requirements increases even faster than the bandwidth required.
It is difficult for traditional software solutions to handle the speed of lookups required to support the next generation of terabit routers. High-speed data retrieval has become a crucial process for next generation QoS enabled networking and cannot be satisfied with traditional software based solutions. The speed advantage offered by specific-purpose solutions have encouraged a steady rise in the number of proposed solutions designed specifically for hardware. The past few years has seen a number of architectures designed and specifically optimized to take full advantage of the unique properties a System on Chip (SoC) implementation can provide. Specifically this includes greater control over memory management and the number of accesses per lookup to slow off-chip memory.
In this paper, we investigate the design and implementation of a hardware-based sorter circuit for closest matching value lookup and propose a novel implementation based on a sorter trie architecture, composed of distributed memory blocks for parallel and pipelined sort and lookup. The latest FPGA technology has been chosen due to the embedded memory features and in particular the ability to implement the pipelined search trie.
Related work
Associated memory and associated memory architectures have been widely investigated in context of network processing and pattern, speech and image recognition. Most of these architectures and implementations were derived by application related constraints, such as the number of entries, cost and lookup performance.
Existing Associative Memory Implementations
Traditional Content Addressable Memories (CAMs) are "hit or miss" components. An entry is either present or not. Therefore, they have a limited suitability for any closest or non-exact match operation, although a number of such implementations have been attempted. In the approach discussed, a numerical best match is needed, although usually such applications desire a best match in terms of Hamming distance. However, in some cases the distance metric is not a decisive feature so that a change of the distance metric could be achieved with a few modifications. Three different sets of existing implementations can be distinguished, which are now discussed.
Extending Exact-Match CAMs
One widely used approach is to make use of standard CAMs (see Figure 1) , which give either an exact match or no match, and to mask different bits of the requested word in subsequent requests to the CAM. At first, no masking bits are set. If there is no match in the CAM, one bit of the requested word is masked and requested again. The masking pattern has to be altered, masking all combinations of one, two, three bits and so on until a match is found. Obviously this is a time consuming process, especially in the case of a wide data word. This approach is used in a parallel form in image coding, using Vector Quantization. The image data is compressed by mapping it to a number of codebook vectors. The mapping process includes a best-match operation. [1] 
Non-CAM Approaches
Because of the high costs and insufficient performance when retrieving inexact matches from CAMs, other implementations avoid the use of CAM cells and find different solutions. A very regular approach is described in [3] . A basic cell containing a word of memory, a comparison unit and control logic is cascaded in a long pipeline (see Figure 2) . The requested word enters the pipeline as an input. Each cell compares the request with its own content and if it fits better than the fit signaled from the previous cell, the cell will now signal its own address to the next cell.
This approach features a predictable (fixed) response time and a high throughput rate since the structure is fully pipelined. However, the latency is proportional to the pipeline length and thus to the memory size and becomes very high for large memories. Another approach described in [2] is used for pattern recognition and is very much dependent on the Hamming metric rather than a closest numerical match as desired. For a given request, the number of matching bits shared by both the subject variable and pre-defined match templates are counted. The Hamming distance is calculated in parallel for all templates using dedicated basic template memory cells and the lowest distance is determined. The design is based on 16 bit values but uses only 16 template patterns. This approach is not scalable to the memory size which is needed in the application discussed here.
Neural Networks
An alternative solution for the best-match problem is the use of neural networks, in particular self-organising feature maps (SOFMs). VLSI implementations of neural networks are not only memories but distributed processing systems with a very large amount of connectivity. The fact that these connections have to be adaptable leads either to a reduced memory density or the use of non-standard VLSI fabrication techniques [3] [4] [5] .
For example in [4] the VLSI design uses a mix of both analog and digital components to implement a neural network, which will operate as an associative memory. Large interconnectivity is cited as the reason why a fully digital approach is not pursued. Again this implementation is more focused on pattern matching or pattern recognition, rather than a closest numerical value match.
A discussion of the principles of associative memories using neuronal networks would exceed the focus of this work.
Architecture
The distinct feature of the proposed closest match lookup architecture is the use of a sorter tree, or so-called "trie", to implement an associative memory, which is able to return closest, not necessarily exact matches. The following architectural study is composed of two parts. The first part targets the principles of information storage and retrieval using a trie structure and the second part explores a number of possible implementations for the matching circuit in terms of area and speed.
Trie structure
The term "trie" is derived from tree and retrieval. It was originally proposed by E. Fredkin [6] as a specialised search tree which stores multiple strings and is well suited for stringmatching. Since binary representations of numeric values can also be seen as strings containing the literals '0' and '1', the original trie structure can be adapted to solve a large number of numeric lookup problems, e.g. finding the trie entry with the smallest Hamming distance to a given value. The implementation discussed in this paper will focus on the problem to find the same or next smallest value in the trie compared to a given value. A trie is characterised by its number of tree levels and by its branching factor. The number of levels determines the length of the strings that can be stored in the trie. The branching factor determines the number of different literals of which the strings can consist. Data is not stored in the trie by writing a value in memory, but by setting flags to indicate the presence or absence of a specific value. The last and largest trie level consists of one flag (memory bit) for each possible value that the trie can store.
From an implementation point of view it is favourable to keep the number of levels low. Since each level is usually accessed in one clock cycle, fewer trie levels reduce the latency of the system. Additionally, the area occupied by the trie is mainly determined by the amount of memory bits it consists of. Tries with fewer levels have less memory overhead. To reduce the tree depth for binary strings of a given length, two or more bits can be grouped together to one single literal and stored in one trie level (multi-bit trie, branching-factor > 2). an example of a multi-bit trie with branching factor four (literals '00', '01', '10' and '11') and three tree levels (allowing strings with three literals or 6 Bit respectively).
To store a string, one bit is set in each level of the trie. In the first level, the flag corresponding to the first literal of the string is set. Then, this tree branch is followed and in the next level the flag corresponding to the second literal is set. This goes on until the last level is reached.
The advantage of storing data in a trie only becomes obvious when data is retrieved. The result is assembled literal by literal while passing through the tree levels. The policies used during this data retrieval vary depending on the application. As mentioned before, this paper focuses on the problem of finding the same or next smallest value compared to a given value, thus the literals are treated as numbers and the retrieval algorithm is as follows: In each level the corresponding input string literal is compared to the values present in the trie and the exact or next smaller match is returned. If in any level a non-exact match occurs, that is, a smaller value than that requested is returned, all subsequent levels have to return their maximum value, otherwise the overall result would not be the next smaller but any smaller value present in the trie.
As an example, the closest match to "11 01 10" in the trie shown in Figure shall be retrieved. First, the initial literal '11' is compared to the present entries in level 1. This comparison returns '11', because the exact match is present in the level. The corresponding tree branch is followed into level two, where another exact match '01' occurs. The result so far is "11 01". Only in the last level there is no entry for '10', so the next smaller value '01' is returned. The final result is "11 01 01".
For a hardware implementation, the matching operator for all nodes in one level can be shared since only one matching operation at a time is performed in each level. Consequently, each tree level consists of a memory (the corresponding size increases with level number) and a "matcher" (see Figure 4 ). The structure can easily be pipelined, one possible pipelining scheme is shown in Figure 4 . It is also easily scalable either by increasing the branching factor or by adding more tree levels. Since the memory access is the most dominant operation, the matcher delay must therefore be close to the memory access delay. For an associative memory system with a fixed word length, the design space is defined by a number of parameters such as branching factor, number of tree levels and number of pipelining stages per matcher. The selected solution depends on the application's demands in terms of throughput and latency.
Matcher Architecture
In most configurations, the matcher performance determines the critical delay. This is due to the sequential nature of the matching operation. A linear search has to be performed to find the next smallest existing entry within a tree level. The basic circuit for performing this type of search is outlined in Figure 5 . The linear search is performed by a ripple logic consisting of elements like the one shown in Figure 6 . This ripple logic operates along the memory bits that represent the entries of the trie level. A decoder is used to "inject" a logic '1' into the ripple path at the position of the requested decoder input value of word length N. This '1' propagates through the ripple logic until it reaches a memory bit set to '1'. Here, the ripple process is stopped and the corresponding enable line is set to '1'. Finally, the resulting value is encoded in binary format. The critical path of the matcher circuit stretches across the input decoder, the full length ripple path and the output encoder. The deployed ripple logic is very similar to a carry ripple adder and the basic compare element is very similar to a full adder. This similarity allows the definition of "generate" and "propagate" conditions like those for carry chains in adders [7] . For an adder,
propagate
the next carry
For the matcher ripple logic,
the ripple output
enable signal ) (
The fact that these equations are valid allows most of the theorems developed to accelerate the carry chain in adders to be transferred to the investigated matcher circuit. In order to improve the matcher's performance, the commonly known techniques including carry-look-ahead, block-carry-look-ahead and the combination of carry-skip and carry-look-ahead, carryselect and carry-look-ahead have been applied and analysed. It should be noted that ripple-chain pipelining can also be de- ployed, in order to reduce the critical path and the subsequent path latency. This feature is considered as future work and is not the focus of this paper.
Look-Ahead Approach
The underlying idea of this approach is to generate each ripple signal r i in parallel instead of making it dependent on the previous ripple signal by trying to achieve a propagation delay complexity of O (1) . Equation (7) shows the dependency of the enable signal en i from the ripple signal r i , the memory signal m i and the decoder signal d i . For a parallel generation, equation (6) 
Equation (8) reveals that each r i can theoretically be obtained in only two logic levels, the outer OR relation and the inner AND relations. However, the number of gate inputs is growing linearly with i, setting constrains on a hardware based implementation. Therefore, the AND and OR functions must to be split into multiple gates for higher i, such that the complexity of the propagation delay is not more than O(log i).
Consequently, an increase of i also means a disproportionate increase of logic due to the splitting of gates into multiple gates and gate stages. Such an increase in hardware cost constrains this approach's suitability to small ripple chains.
Block Look-Ahead Approach
To address the disadvantages of the look-ahead approach, a hierarchical structure is applied to the basic look-ahead idea. A functional block is introduced (Figure 7 ) that performs the look-ahead operation only for a restricted number of bits generating the signals G (group generate) and P (group propagate). 
By introducing one or more additional hierarchy levels (see Figure 7) , the input count and thus the hardware cost of the look-ahead-blocks can be reduced. While the maximum value for i in equation (8) is the length of the matcher M, in equation (9) M B is only a fraction of this length. The overall hardware cost can be reduced, resulting in lower propagation delay than the mere look-ahead approach.
Skip & Look-Ahead Approach
The carry-skip structure for adders offers a good trade-off between delay and area cost. In [8] an optimised skip structure is proposed using a look-ahead technique within the skip blocks using variable block sizes. This approach can also be transferred to the matcher problem. The idea of the skip approach is to group a given number of bits into a block and bypass the ripple (carry) signal for the block if all propagate inputs to the block are '1'. Figure 8 shows the basic skip block, while Figure  9 shows the resulting chain of skip blocks. Since ripple initiating blocks cannot be bypassed, it is not recommended to bypass the block where the ripple signal ends. However, all blocks between the ripple initiating and ripple ending blocks can be bypassed. Thus, the ripple's maximum path is composed of the first and last block of the skip & lookahead architecture, bypassing all other blocks. Schulte et al [8] propose a scheme of optimised variable block sizes. Since the lengths of the blocks, other than the first and the last, do not contribute to the critical path, it can be increased within given limits. The limitation is that other possible paths through the system and the enlarged blocks cannot become longer than the worst case path described above. By increasing the block sizes, the number of blocks needed for a given total length is reduced. Consequently, the critical path delay is also reduced.
Select & Look-Ahead Approach
Another combination of two acceleration techniques is the combined select & look-ahead approach, as proposed for adders e.g. by Wang et al [9] . In this case the matcher structure can be significantly simplified compared to a carry-select adder. The main difference between an adder carry chain and the matcher ripple chain is that once the ripple signal has "left" the chain, i.e. when it has changed back from '1' to '0', it will not change to '1' again. In this case, the rest of the ripple chain can be ignored in contrast to an adder where another carry could be generated.
The idea is again to divide the ripple chain into blocks but let all blocks calculate their result simultaneously. Within the blocks, the ripple process is again accelerated using a lookahead technique. The ripple inputs r_in for each block are con- trolled by the "result control" circuit as shown in Figure 10 . This architecture exploits the simpler nature of the matcher problem, compared to the adder problem, by performing a true parallel result calculation. This is only applicable for the matcher problem because the matcher has only a single search path, whereas an adder can have multiple possible carry propagations. 
Synthesis and Circuit Analysis
All circuit architectures discussed in Section 3 have been designed using VHDL and synthesised using Quartus II [10] for Altera Stratix II FPGA technology. All architectures implement a two-directional matcher function, consisting of two matcher circuits, the first of which searches for the next smallest match and the second for the next highest match. This is necessary to avoid a "nil" return if a smaller match cannot be found. In this case, the next higher match can be returned. All architectures have been scaled over a range of word lengths and tree branching factors. However, not all of the investigated architectures were suited for scaling to all word lengths. In particular, the skip and select approaches require a certain minimum length in order to arrange more than one block along the lines, the lookahead approach become hardware expensive for a word length beyond 64 bits and the block look-ahead approach is only useful for lengths that are equal to (blocksize) 
Matcher results
The post-layout synthesis results for the matcher architectures are presented in Table 1 and Figure 11 , Figure 12 and Figure 13 . Although most delay and area results are very close to each other, they mostly reflect the expected behaviour from design studies carried out for adder designs.
For small word lengths of 4 and 8 bits, the classical ripple carry approach proves to be the best. This is because the simple ripple structure can be mapped onto a reasonably small logic path compared to the more complex look-ahead approaches. For small word lengths the propagation delays and number of Advanced Lookup Tables (ALUTs) The improvement gained by introducing the block lookahead approach can be clearly seen in Figure 12 at 64-bit word length. In this case, the matcher circuit is up to 3 times smaller the circuit based on the classical look-ahead approach. In terms of speed, the reduced hardware complexity results also in a slightly smaller delay at 64-bit word length.
The combined skip & look-ahead approach shows a further reduction of the area cost by trading off ripple delay performance. For word lengths greater than 32-bit, it is slower than all the other look-ahead based architectures. The best trade-off, in terms of area versus delay performance, is achieved by the select & look-ahead architecture for word lengths greater than 8 bits. It is the most area efficient architecture amongst all lookahead based approaches, while maintaining the smallest ripple delay. This significant performance advantage is achieved due to the application specific simplification of the architecture and subsequent reduction of the hardware cost.
Focussing only on M = 64 Bit, from the AT-Diagram in Figure 13 it can be seen that the Pareto front (see e.g. [11] ) of most attractive implementation alternatives is determined by the three architectural alternatives ripple cells, select & look-ahead and block look-ahed. For low number of ALUTs (below 400) the ripple approach dominates all others. For a medium number of ALUTs (500) the combined select & look-ahead approach has the shorted propagation delay. Lowest delays can be 
Synthesis of the lookup trie
Based on the detailed circuit study, the select & look-ahead matcher was chosen to implement the complete lookup trie, as shown in Figure 4 . Three different tries for data word lengths of 12 bit, 16 bit and 20 bit have been implemented using the pipeline scheme. For each data word length, different combinations of branching factor and number of trie levels are possible. Due to the constraints set by the application in terms of low latency, a branching factors of 8 and 16 were chosen to keep the number of trie levels as low as possible. Theoretically, the critical path of the circuit is determined only by the branching factor. For example a trie with branching factor 16 has a critical path delay of 5.2 ns allowing an implementation at f max = 1 / 5.2 ns = 192 MHz. In reality, this maximum frequency cannot be achieved due to additional routing delays. Experimental synthesis results showed that for a 16 Bit trie with four levels and a branching factor of 16, only 154 MHz can be achieved. It is evident that the added routing delay increase is dependent on the number of trie levels as shown in Figure 14 . This is due to the fact that the embedded trie memory is implemented using the M4K memory blocks, which are located at the centre of the Altera Stratix II FPGA. As expected, the place and route tool places the matcher circuitry around the central memory. This has major implications for the overall routing delay performance if the number of matcher circuits increases. The more circuits that need to be arranged around the centre, the longer the routing between the matcher circuit and the memory gets. This is the cases for more trie levels and evident in Figure 14 .
Conclusions
In this paper the architecture and implementation of a hardware based sorter circuit, operating a closest match lookup for network processing has been explored. An architecture based on a search trie is used to obtain the closest match of an input value amongst an integral number of lookup table entries. The detailed design study reveals a number of design issues concerning associative memory design for the closest match lookup problem and identifies the matching circuit to be the bottleneck in the architecture. Comparable bottleneck issues for arithmetic circuits, which have been resolved using look-ahead based approaches, have been investigated and translated for application in the matcher circuit design.
Numerous look-ahead architectures have been designed for the look ahead circuit, carefully considering different word lengths and branching factors. The post layout synthesis results for given word-lengths have been analysed in terms of scalability, critical path and hardware cost. For word lengths greater than 8 Bit, it has been found that the select & look-ahead design achieves the best trade-off, in terms of area versus delay performance and presents the most suitable approach for the closest-match lookup problem. The study also reveals that the select & look-ahead is the most area efficient architecture amongst all look-ahead designs. This is because the architecture could be simplified for the matcher problem.
This novel closest-match lookup circuit for 16 Bit word lengths and 4x4bit branching factor has been design for packet scheduling. The architecture is able to support an operation frequency of up to 154 MHz using standard FPGA technology. In the overall architecture for which it is designed, it is able to retrieve up to 40 million IP packets per second for service.
