16 research outputs found

    A Scalable High-Performance Memory-Less IP Address Lookup Engine Suitable for FPGA Implementation

    Get PDF
    RÉSUMÉ La recherche d'adresse IP est une opĂ©ration trĂšs importante pour les routeurs Internet modernes. De nombreuses approches dans la littĂ©rature ont Ă©tĂ© proposĂ©es pour rĂ©aliser des moteurs de recherche d'adresse IP (Address Lookup Engine – ALE), Ă  haute performance. Les ALE existants peuvent ĂȘtre classĂ©s dans l’une ou l’autre de trois catĂ©gories basĂ©es sur: les mĂ©moires ternaires adressables par le contenu (TCAM), les Trie et les Ă©mulations de TCAM. Les approches qui se basent sur des TCAM sont coĂ»teuses et elles consomment beaucoup d'Ă©nergie. Les techniques qui exploitent les Trie ont une latence non dĂ©terministe qui nĂ©cessitent gĂ©nĂ©ralement des accĂšs Ă  une mĂ©moire externe. Les techniques qui exploitent des Ă©mulations de TCAM combinent gĂ©nĂ©ralement des TCAM avec des circuits Ă  faible coĂ»t. Dans ce mĂ©moire, l'objectif principal est de proposer une architecture d'ALE qui permet la recherche rapide d’adresses IP et qui apporte une solution aux principales lacunes des techniques basĂ©es sur des TCAM et sur des Trie. Atteindre une vitesse de traitement suffisante dans l'ALE est un aspect important. Des accĂ©lĂ©rateurs matĂ©riels ont Ă©tĂ© adoptĂ©s pour obtenir une le rĂ©sultat de recherche Ă  haute vitesse. Le FPGA permettent la mise en Ɠuvre d’accĂ©lĂ©rateurs matĂ©riels reconfigurables spĂ©cialisĂ©s. Cinq architectures d’ALE de type Ă©mulation de TCAM sont proposĂ©s dans ce mĂ©moire : une sĂ©rielle, une parallĂšle, une architecture dite IP-Split, une variante appelĂ©e IP-Split-Bucket et une version de l’IP-Split-Bucket qui supporte les mises Ă  jours. Chaque architecture est construite Ă  partir de l’architecture prĂ©cĂ©dente de maniĂšre progressive dans le but d’en amĂ©liorer les performances. L'architecture sĂ©rielle utilise des mĂ©moires pour stocker la table d’adresses de transmission et un comparateur pour effectuer une recherche sĂ©rielle sur les entrĂ©es. L'architecture parallĂšle stocke les entrĂ©es de la table dans les ressources logiques d’un FPGA, et elle emploie une recherche parallĂšle en utilisant N comparateurs pour une table avec N entrĂ©es. L’architecture IP-Split emploie un niveau de dĂ©codeurs pour Ă©viter des comparaisons rĂ©pĂ©titives dans les entrĂ©es Ă©quivalentes de la table. L'architecture IP-Split-Bucket est une version amĂ©liorĂ©e de l'architecture prĂ©cĂ©dente qui utilise une mĂ©thode de partitionnement visant Ă  optimiser l'architecture IP-Split. L’IP-Split-Bucket qui supporte les mises Ă  jour est la derniĂšre architecture proposĂ©e. Elle soutient la mise Ă  jour et la recherche Ă  haute vitesse d'adresses IP. Les rĂ©sultats d’implĂ©mentations montrent que l'architecture d’ALE qui offre les meilleures performances est l’IP-Split-Bucket, qui n’a pas recours Ă  une ou plusieurs mĂ©moires. Pour une table d’adresses de transmission IPv4 rĂ©elle comportant 524 k prĂ©fixes, l'architecture IP-Split-Bucket atteint un dĂ©bit de 103,4 M paquets par seconde et elle consomme respectivement 23% et 22% des tables de conversion (LUTs) et des bascules (FFs) sur une puce Xilinx XC7V2000T.----------ABSTRACT High-performance IP address lookup is highly demanded for modern Internet routers. Many approaches in the literature describe a special purpose Address Lookup Engines (ALE), for IP address lookup. The existing ALEs can be categorised into the following techniques: Ternary Content Addressable Memories-based (TCAM-based), trie-based and TCAM-emulation. TCAM-based techniques are expensive and consume a lot of power, since they employ TCAMs in their architecture. Trie-based techniques have nondeterministic latency and external memory accesses, since they store the Forwarding Information Base (FIB) in the memory using a trie data structure. TCAM-emulation techniques commonly combine TCAMs with lower-cost circuits that handle less time-critical activities. In this thesis, the main objective is to propose an ALE architecture with fast search that addresses the main shortcomings of TCAM-based and trie-based techniques. Achieving an admissible throughput in the proposed ALE is its fundamental requirement due to the recent improvements of network systems and growth of Internet of Things (IoTs). For that matter, hardware accelerators have been adopted to achieve a high speed search. In this work, Field Programmable Gate Arrays (FPGAs) are specialized reconfigurable hardware accelerators chosen as the target platform for the ALE architecture. Five TCAM-emulation ALE architectures are proposed in this thesis: the Full-Serial, the Full-Parallel, the IP-Split, the IP-Split-Bucket and the Update-enabled IP-Split-Bucket architectures. Each architecture builds on the previous one with progressive improvements. The Full-Serial architecture employs memories to store the FIB and one comparator to perform a serial search on the FIB entries. The Full-Parallel architecture stores the FIB entries into the logical resources of the FPGA and employs a parallel search using one comparator for each FIB entry. The IP-Split architecture employs a level of decoders to avoid repetitive comparisons in the equivalent entries of the FIB. The IP-Split-Bucket architecture is an upgraded version of the previous architecture using a partitioning scheme aiming to optimize the IP-Split architecture. Finally, the Update-enabled IP-Split-Bucket supports high-update rate IP address lookup. The most efficient proposed architecture is the IP-Split-Bucket, which is a novel high-performance memory-less ALE. For a real-world FIB with 524 k IPv4 prefixes, IP-Split-Bucket achieves a throughput of 103.4M packets per second and consumes respectively 23% and 22% of the Look Up Tables (LUTs) and Flip-Flops (FFs) of a Xilinx XC7V2000T chip

    Mémoires associatives algorithmiques pou l'opération de recherche du plus long préfixe sur FPGA

    Get PDF
    RÉSUMÉ Les rĂ©seaux prĂ©diïŹ€usĂ©s programmables — en anglais Field Programmable Gate Arrays (FPGAs)— sont omniprĂ©sents dans les centres de donnĂ©es, pour accĂ©lĂ©rer des tĂąches d’indexations et d’apprentissage machine, mais aussi plus rĂ©cemment, pour accĂ©lĂ©rer des opĂ©rations rĂ©seaux. Dans cette thĂšse, nous nous intĂ©ressons Ă  l’opĂ©ration de recherche du plus long prĂ©ïŹxe en anglais Longest PreïŹx Match (LPM) — sur FPGA. Cette opĂ©ration est utilisĂ©e soit pour router des paquets, soit comme un bloc de base dans un plan de donnĂ©es programmable. Bien que l’opĂ©ration LPM soit primordiale dans un rĂ©seau, celle-ci souïŹ€re d’ineïŹƒcacitĂ© sur FPGA. Dans cette thĂšse, nous dĂ©montrons que la performance de l’opĂ©ration LPM sur FPGA peut ĂȘtre substantiellement amĂ©liorĂ©e en utilisant une approche algorithmique, oĂč l’opĂ©ration LPM est implĂ©mentĂ©e Ă  l’aide d’une structure de donnĂ©es. Par ailleurs, les rĂ©sultats prĂ©sentĂ©s permettent de rĂ©ïŹ‚Ă©chir Ă  une question plus large : est-ce que l’architecture des FPGA devrait ĂȘtre spĂ©cialisĂ©e pour les applications rĂ©seaux ? PremiĂšrement, pour l’application de routage IPv6 dans le rĂ©seau Internet, nous prĂ©sentons SHIP. Cette solution exploite les caractĂ©ristiques des prĂ©ïŹxes pour construire une structure de donnĂ©es compacte, pouvant ĂȘtre implĂ©mentĂ©e de maniĂšre eïŹƒcace sur FPGA. SHIP utilise l’approche Ƅdiviser pour rĂ©gnerĆŒ pour sĂ©parer les prĂ©ïŹxes en groupes de faible cardinalitĂ© et ayant des caractĂ©ristiques similaires. Les prĂ©ïŹxes contenus dans chaque groupe sont en-suite encodĂ©s dans une structure de donnĂ©es hybride, oĂč l’encodage des prĂ©ïŹxes est adaptĂ© suivant leurs caractĂ©ristiques. Sur FPGA, SHIP augmente l’eïŹƒcacitĂ© de l’opĂ©ration LPM comparativement Ă  l’état de l’art, tout en supportant un dĂ©bit supĂ©rieur Ă  100 Gb/s. DeuxiĂšment, nous prĂ©sentons comment implĂ©menter eïŹƒcacement l’opĂ©ration LPM pour un plan de donnĂ©es programmable sur FPGA. Dans ce cas, contrairement au routage de pa-quets, aucune connaissance Ă  priori des prĂ©ïŹxes ne peut ĂȘtre utilisĂ©e. Par consĂ©quent, nous prĂ©sentons un cadre de travail comprenant une structure de donnĂ©es eïŹƒcace, indĂ©pendam-ment des caractĂ©ristiques des prĂ©ïŹxes contenus, et des mĂ©thodes permettant d’implĂ©menter eïŹƒcacement la structure de donnĂ©es sur FPGA. Un arbre B, Ă©tendu pour l’opĂ©ration LPM, est utilisĂ© en raison de sa faible complexitĂ© algorithmique. Nous prĂ©sentons une mĂ©thode pour allouer Ă  la compilation le minimum de ressources requis par l’abre B pour encoder un ensemble de prĂ©ïŹxes, indĂ©pendamment de leurs caractĂ©ristiques. Plusieurs mĂ©thodes sont ensuite prĂ©sentĂ©es pour augmenter l’eïŹƒcacitĂ© mĂ©moire aprĂšs implĂ©mentation de la structure de donnĂ©es sur FPGA. ÉvaluĂ©e sur plusieurs scĂ©narios, cette solution est capable de traiter plus de 100 Gb/s, tout en amĂ©liorant la performance par rapport Ă  l’état de l’art.----------ABSTRACT FPGAs are becoming ubiquitous in data centers. First introduced to accelerate indexing services and machine learning tasks, FPGAs are now also used to accelerate networking operations, including the LPM operation. This operation is used for packet routing and as a building block in programmable data planes. However, for the two uses cases considered, the LPM operation is ineïŹƒciently implemented in FPGAs. In this thesis, we demonstrate that the performance of LPM operation can be signiïŹcantly improved using an algorithmic approach, where the LPM operation is implemented using a data structure. In addition, using the results presented in this thesis, we can answer a broader question: Should the FPGA architecture be specialized for networking? First, we present the SHIP data structure that is tailored to routing IPv6 packets in the Internet network. SHIP exploits the preïŹx characteristics to build a compact data structure that can be eïŹƒciently mapped to FPGAs. First, SHIP uses a "divide and conquer" approach to bin preïŹxes in groups with a small cardinality and sharing similar characteristics. Second, a hybrid-trie-tree data structure is used to encode the preïŹxes held in each group. The hybrid data structure adapts the preïŹx encoding method to their characteristics. Then, we demonstrated that SHIP can be eïŹƒciently implemented in FPGAs. Implemented on FPGAs, the proposed solution improves the memory eïŹƒciency over the state of the art solutions, while supporting a packet throughput greater than 100 Gbps.While the preïŹxes and their characteristics are known when routing packets in the Internet network, this is not true for programmable data planes. Hence, the second solution, designed for programmable data planes, does not exploit any prior knowledge of the preïŹx stored. We present a framework comprising an eïŹƒcient data structure to encode the preïŹxes and methods to map the data structure eïŹƒciently to FPGAs. First, the framework leverages a B-tree, extended to support the LPM operation, for its low algorithmic complexity. Second, we present a method to allocate at compile time the minimum amount of resources that can be used by the B-tree. Third, our framework selects the B-tree parameters to increase the post-implementation memory eïŹƒciency and generates the corresponding hardware architecture. Implemented on FPGAs, this solution supports packet throughput greater than 100 Gbps, while improving the performance over the state of the art

    FPGA-based High Throughput Regular Expression Pattern Matching for Network Intrusion Detection Systems

    Get PDF
    Network speeds and bandwidths have improved over time. However, the frequency of network attacks and illegal accesses have also increased as the network speeds and bandwidths improved over time. Such attacks are capable of compromising the privacy and confidentiality of network resources belonging to even the most secure networks. Currently, general-purpose processor based software solutions used for detecting network attacks have become inadequate in coping with the current network speeds. Hardware-based platforms are designed to cope with the rising network speeds measured in several gigabits per seconds (Gbps). Such hardware-based platforms are capable of detecting several attacks at once, and a good candidate is the Field-programmable Gate Array (FPGA). The FPGA is a hardware platform that can be used to perform deep packet inspection of network packet contents at high speed. As such, this thesis focused on studying designs that were implemented with Field-programmable Gate Arrays (FPGAs). Furthermore, all the FPGA-based designs studied in this thesis have attempted to sustain a more steady growth in throughput and throughput efficiency. Throughput efficiency is defined as the concurrent throughput of a regular expression matching engine circuit divided by the average number of look up tables (LUTs) utilised by each state of the engine"s automata. The implemented FPGA-based design was built upon the concept of equivalence classification. The concept helped to reduce the overall table size of the inputs needed to drive the various Nondeterministic Finite Automata (NFA) matching engines. Compared with other approaches, the design sustained a throughput of up to 11.48 Gbps, and recorded an overall reduction in the number of pattern matching engines required by up to 75%. Also, the overall memory required by the design was reduced by about 90% when synthesised on the target FPGA platform

    Ant colony optimization on runtime reconfigurable architectures

    Get PDF

    Acceleration for the many, not the few

    Get PDF
    Although specialized hardware promises orders of magnitude performance gains, their uptake has been limited by how challenging it is to program them. Hardware accelerators present challenges programmers are not used to, exposing details of the hardware that are often hidden and requiring new programming styles to use them effectively. Existing programming models often involve learning complex and hardware-specific APIs, using Domain Specific Languages (DSLs), or programming in customized assembly languages. These programming models for hardware accelerators present a significant challenge to uptake: a steep, unforgiving, and untransferable learning curve. However, programming hardware accelerators using traditional programming models presents a challenge: mapping code not written with hardware accelerators in mind to accelerators with restricted behaviour. This thesis presents these challenges in the context of the acceleration equation, and it presents solutions to it in three different contexts: for regular expression accelerators, for API-programmable accelerators (with Fourier Transforms as a key case-study) and for heterogeneous coarse-grained reconfigurable arrays (CGRAs). This thesis shows that automatically morphing software written in traditional manners to fit hardware accelerators is possible with no programmer effort and that huge potential speedups are available

    The 1991 3rd NASA Symposium on VLSI Design

    Get PDF
    Papers from the symposium are presented from the following sessions: (1) featured presentations 1; (2) very large scale integration (VLSI) circuit design; (3) VLSI architecture 1; (4) featured presentations 2; (5) neural networks; (6) VLSI architectures 2; (7) featured presentations 3; (8) verification 1; (9) analog design; (10) verification 2; (11) design innovations 1; (12) asynchronous design; and (13) design innovations 2

    High Performance IP Lookup on FPGA with Combined Length-Infix Pipelined Search

    No full text
    We propose a combined length-infix pipelined search (CLIPS) architecture for high-performance IP lookup on FPGA. By performing binary search in prefix length, CLIPS can find the longest prefix match in (log L - c) phases, where L is the IP address length (32 for IPv4) and c > 0 is a small design constant (c = 2 in our prototype design). Each CLIPS phase matches one or more input infixes of the same length against a regular data structure. Various CLIPS phases can be optimized individually: (1) 16 bits of the IP address are used to direct-access a 288-kbit on-chip BRAM in phase 1; (2) 8 additional bits of the IP address are used to search a 1.5-million-entry pipelined dynamic search forest for a match in phase 2; (3) 1 to 8 additional bits of the IP address are used by a 2-stage TreeBitmap for storing another 1 to 8 million routing prefixes in the tail phase. Post place-and-route results show that our CLIPS prototype, utilizing 28 Mbits on-chip BRAM and 4 external SRAM channels, sustains 312 MPPS IPv4 lookup (or 160 Gbps routing thoughput with 64-byte packets) against 9.5 million prefixes on state-of-the-art FPGA
    corecore