Search CORE

16 research outputs found

A Scalable High-Performance Memory-Less IP Address Lookup Engine Suitable for FPGA Implementation

Author: Sarbishei Ideh
Publication venue
Publication date: 01/11/2016
Field of study

RÉSUMÉ La recherche d'adresse IP est une opération très importante pour les routeurs Internet modernes. De nombreuses approches dans la littérature ont été proposées pour réaliser des moteurs de recherche d'adresse IP (Address Lookup Engine – ALE), à haute performance. Les ALE existants peuvent être classés dans l’une ou l’autre de trois catégories basées sur: les mémoires ternaires adressables par le contenu (TCAM), les Trie et les émulations de TCAM. Les approches qui se basent sur des TCAM sont coûteuses et elles consomment beaucoup d'énergie. Les techniques qui exploitent les Trie ont une latence non déterministe qui nécessitent généralement des accès à une mémoire externe. Les techniques qui exploitent des émulations de TCAM combinent généralement des TCAM avec des circuits à faible coût. Dans ce mémoire, l'objectif principal est de proposer une architecture d'ALE qui permet la recherche rapide d’adresses IP et qui apporte une solution aux principales lacunes des techniques basées sur des TCAM et sur des Trie. Atteindre une vitesse de traitement suffisante dans l'ALE est un aspect important. Des accélérateurs matériels ont été adoptés pour obtenir une le résultat de recherche à haute vitesse. Le FPGA permettent la mise en œuvre d’accélérateurs matériels reconfigurables spécialisés. Cinq architectures d’ALE de type émulation de TCAM sont proposés dans ce mémoire : une sérielle, une parallèle, une architecture dite IP-Split, une variante appelée IP-Split-Bucket et une version de l’IP-Split-Bucket qui supporte les mises à jours. Chaque architecture est construite à partir de l’architecture précédente de manière progressive dans le but d’en améliorer les performances. L'architecture sérielle utilise des mémoires pour stocker la table d’adresses de transmission et un comparateur pour effectuer une recherche sérielle sur les entrées. L'architecture parallèle stocke les entrées de la table dans les ressources logiques d’un FPGA, et elle emploie une recherche parallèle en utilisant N comparateurs pour une table avec N entrées. L’architecture IP-Split emploie un niveau de décodeurs pour éviter des comparaisons répétitives dans les entrées équivalentes de la table. L'architecture IP-Split-Bucket est une version améliorée de l'architecture précédente qui utilise une méthode de partitionnement visant à optimiser l'architecture IP-Split. L’IP-Split-Bucket qui supporte les mises à jour est la dernière architecture proposée. Elle soutient la mise à jour et la recherche à haute vitesse d'adresses IP. Les résultats d’implémentations montrent que l'architecture d’ALE qui offre les meilleures performances est l’IP-Split-Bucket, qui n’a pas recours à une ou plusieurs mémoires. Pour une table d’adresses de transmission IPv4 réelle comportant 524 k préfixes, l'architecture IP-Split-Bucket atteint un débit de 103,4 M paquets par seconde et elle consomme respectivement 23% et 22% des tables de conversion (LUTs) et des bascules (FFs) sur une puce Xilinx XC7V2000T.----------ABSTRACT High-performance IP address lookup is highly demanded for modern Internet routers. Many approaches in the literature describe a special purpose Address Lookup Engines (ALE), for IP address lookup. The existing ALEs can be categorised into the following techniques: Ternary Content Addressable Memories-based (TCAM-based), trie-based and TCAM-emulation. TCAM-based techniques are expensive and consume a lot of power, since they employ TCAMs in their architecture. Trie-based techniques have nondeterministic latency and external memory accesses, since they store the Forwarding Information Base (FIB) in the memory using a trie data structure. TCAM-emulation techniques commonly combine TCAMs with lower-cost circuits that handle less time-critical activities. In this thesis, the main objective is to propose an ALE architecture with fast search that addresses the main shortcomings of TCAM-based and trie-based techniques. Achieving an admissible throughput in the proposed ALE is its fundamental requirement due to the recent improvements of network systems and growth of Internet of Things (IoTs). For that matter, hardware accelerators have been adopted to achieve a high speed search. In this work, Field Programmable Gate Arrays (FPGAs) are specialized reconfigurable hardware accelerators chosen as the target platform for the ALE architecture. Five TCAM-emulation ALE architectures are proposed in this thesis: the Full-Serial, the Full-Parallel, the IP-Split, the IP-Split-Bucket and the Update-enabled IP-Split-Bucket architectures. Each architecture builds on the previous one with progressive improvements. The Full-Serial architecture employs memories to store the FIB and one comparator to perform a serial search on the FIB entries. The Full-Parallel architecture stores the FIB entries into the logical resources of the FPGA and employs a parallel search using one comparator for each FIB entry. The IP-Split architecture employs a level of decoders to avoid repetitive comparisons in the equivalent entries of the FIB. The IP-Split-Bucket architecture is an upgraded version of the previous architecture using a partitioning scheme aiming to optimize the IP-Split architecture. Finally, the Update-enabled IP-Split-Bucket supports high-update rate IP address lookup. The most efficient proposed architecture is the IP-Split-Bucket, which is a novel high-performance memory-less ALE. For a real-world FIB with 524 k IPv4 prefixes, IP-Split-Bucket achieves a throughput of 103.4M packets per second and consumes respectively 23% and 22% of the Look Up Tables (LUTs) and Flip-Flops (FFs) of a Xilinx XC7V2000T chip

PolyPublie

Mémoires associatives algorithmiques pou l'opération de recherche du plus long préfixe sur FPGA

Author: Stimpfling Thibaut
Publication venue
Publication date: 01/04/2020
Field of study

RÉSUMÉ Les réseaux prédiﬀusés programmables — en anglais Field Programmable Gate Arrays (FPGAs)— sont omniprésents dans les centres de données, pour accélérer des tâches d’indexations et d’apprentissage machine, mais aussi plus récemment, pour accélérer des opérations réseaux. Dans cette thèse, nous nous intéressons à l’opération de recherche du plus long préﬁxe en anglais Longest Preﬁx Match (LPM) — sur FPGA. Cette opération est utilisée soit pour router des paquets, soit comme un bloc de base dans un plan de données programmable. Bien que l’opération LPM soit primordiale dans un réseau, celle-ci souﬀre d’ineﬃcacité sur FPGA. Dans cette thèse, nous démontrons que la performance de l’opération LPM sur FPGA peut être substantiellement améliorée en utilisant une approche algorithmique, où l’opération LPM est implémentée à l’aide d’une structure de données. Par ailleurs, les résultats présentés permettent de réﬂéchir à une question plus large : est-ce que l’architecture des FPGA devrait être spécialisée pour les applications réseaux ? Premièrement, pour l’application de routage IPv6 dans le réseau Internet, nous présentons SHIP. Cette solution exploite les caractéristiques des préﬁxes pour construire une structure de données compacte, pouvant être implémentée de manière eﬃcace sur FPGA. SHIP utilise l’approche ńdiviser pour régnerż pour séparer les préﬁxes en groupes de faible cardinalité et ayant des caractéristiques similaires. Les préﬁxes contenus dans chaque groupe sont en-suite encodés dans une structure de données hybride, où l’encodage des préﬁxes est adapté suivant leurs caractéristiques. Sur FPGA, SHIP augmente l’eﬃcacité de l’opération LPM comparativement à l’état de l’art, tout en supportant un débit supérieur à 100 Gb/s. Deuxièment, nous présentons comment implémenter eﬃcacement l’opération LPM pour un plan de données programmable sur FPGA. Dans ce cas, contrairement au routage de pa-quets, aucune connaissance à priori des préﬁxes ne peut être utilisée. Par conséquent, nous présentons un cadre de travail comprenant une structure de données eﬃcace, indépendam-ment des caractéristiques des préﬁxes contenus, et des méthodes permettant d’implémenter eﬃcacement la structure de données sur FPGA. Un arbre B, étendu pour l’opération LPM, est utilisé en raison de sa faible complexité algorithmique. Nous présentons une méthode pour allouer à la compilation le minimum de ressources requis par l’abre B pour encoder un ensemble de préﬁxes, indépendamment de leurs caractéristiques. Plusieurs méthodes sont ensuite présentées pour augmenter l’eﬃcacité mémoire après implémentation de la structure de données sur FPGA. Évaluée sur plusieurs scénarios, cette solution est capable de traiter plus de 100 Gb/s, tout en améliorant la performance par rapport à l’état de l’art.----------ABSTRACT FPGAs are becoming ubiquitous in data centers. First introduced to accelerate indexing services and machine learning tasks, FPGAs are now also used to accelerate networking operations, including the LPM operation. This operation is used for packet routing and as a building block in programmable data planes. However, for the two uses cases considered, the LPM operation is ineﬃciently implemented in FPGAs. In this thesis, we demonstrate that the performance of LPM operation can be signiﬁcantly improved using an algorithmic approach, where the LPM operation is implemented using a data structure. In addition, using the results presented in this thesis, we can answer a broader question: Should the FPGA architecture be specialized for networking? First, we present the SHIP data structure that is tailored to routing IPv6 packets in the Internet network. SHIP exploits the preﬁx characteristics to build a compact data structure that can be eﬃciently mapped to FPGAs. First, SHIP uses a "divide and conquer" approach to bin preﬁxes in groups with a small cardinality and sharing similar characteristics. Second, a hybrid-trie-tree data structure is used to encode the preﬁxes held in each group. The hybrid data structure adapts the preﬁx encoding method to their characteristics. Then, we demonstrated that SHIP can be eﬃciently implemented in FPGAs. Implemented on FPGAs, the proposed solution improves the memory eﬃciency over the state of the art solutions, while supporting a packet throughput greater than 100 Gbps.While the preﬁxes and their characteristics are known when routing packets in the Internet network, this is not true for programmable data planes. Hence, the second solution, designed for programmable data planes, does not exploit any prior knowledge of the preﬁx stored. We present a framework comprising an eﬃcient data structure to encode the preﬁxes and methods to map the data structure eﬃciently to FPGAs. First, the framework leverages a B-tree, extended to support the LPM operation, for its low algorithmic complexity. Second, we present a method to allocate at compile time the minimum amount of resources that can be used by the B-tree. Third, our framework selects the B-tree parameters to increase the post-implementation memory eﬃciency and generates the corresponding hardware architecture. Implemented on FPGAs, this solution supports packet throughput greater than 100 Gbps, while improving the performance over the state of the art

PolyPublie

FPGA-based High Throughput Regular Expression Pattern Matching for Network Intrusion Detection Systems

Author: Modi Bala
Publication venue
Publication date: 01/02/2015
Field of study

Network speeds and bandwidths have improved over time. However, the frequency of network attacks and illegal accesses have also increased as the network speeds and bandwidths improved over time. Such attacks are capable of compromising the privacy and confidentiality of network resources belonging to even the most secure networks. Currently, general-purpose processor based software solutions used for detecting network attacks have become inadequate in coping with the current network speeds. Hardware-based platforms are designed to cope with the rising network speeds measured in several gigabits per seconds (Gbps). Such hardware-based platforms are capable of detecting several attacks at once, and a good candidate is the Field-programmable Gate Array (FPGA). The FPGA is a hardware platform that can be used to perform deep packet inspection of network packet contents at high speed. As such, this thesis focused on studying designs that were implemented with Field-programmable Gate Arrays (FPGAs). Furthermore, all the FPGA-based designs studied in this thesis have attempted to sustain a more steady growth in throughput and throughput efficiency. Throughput efficiency is defined as the concurrent throughput of a regular expression matching engine circuit divided by the average number of look up tables (LUTs) utilised by each state of the engine"s automata. The implemented FPGA-based design was built upon the concept of equivalence classification. The concept helped to reduce the overall table size of the inputs needed to drive the various Nondeterministic Finite Automata (NFA) matching engines. Compared with other approaches, the design sustained a throughput of up to 11.48 Gbps, and recorded an overall reduction in the number of pattern matching engines required by up to 75%. Also, the overall memory required by the design was reduced by about 90% when synthesised on the target FPGA platform

Kent Academic Repository

Ant colony optimization on runtime reconfigurable architectures

Author: Scheuermann Bernd
Publication venue: KIT-Bibliothek, Karlsruhe
Publication date: 01/01/2005
Field of study

KITopen

Acceleration for the many, not the few

Author: Woodruff Jackson
Publication venue: The University of Edinburgh
Publication date: 06/08/2024
Field of study

Although specialized hardware promises orders of magnitude performance gains, their uptake has been limited by how challenging it is to program them. Hardware accelerators present challenges programmers are not used to, exposing details of the hardware that are often hidden and requiring new programming styles to use them effectively. Existing programming models often involve learning complex and hardware-specific APIs, using Domain Specific Languages (DSLs), or programming in customized assembly languages. These programming models for hardware accelerators present a significant challenge to uptake: a steep, unforgiving, and untransferable learning curve. However, programming hardware accelerators using traditional programming models presents a challenge: mapping code not written with hardware accelerators in mind to accelerators with restricted behaviour. This thesis presents these challenges in the context of the acceleration equation, and it presents solutions to it in three different contexts: for regular expression accelerators, for API-programmable accelerators (with Fourier Transforms as a key case-study) and for heterogeneous coarse-grained reconfigurable arrays (CGRAs). This thesis shows that automatically morphing software written in traditional manners to fit hardware accelerators is possible with no programmer effort and that huge potential speedups are available

Edinburgh Research Archive

The 1991 3rd NASA Symposium on VLSI Design

Author: Maki Gary K.
Publication venue
Publication date
Field of study

Papers from the symposium are presented from the following sessions: (1) featured presentations 1; (2) very large scale integration (VLSI) circuit design; (3) VLSI architecture 1; (4) featured presentations 2; (5) neural networks; (6) VLSI architectures 2; (7) featured presentations 3; (8) verification 1; (9) analog design; (10) verification 2; (11) design innovations 1; (12) asynchronous design; and (13) design innovations 2

NASA Technical Reports Server

High Performance IP Lookup on FPGA with Combined Length-Infix Pipelined Search

Author: Erdem Oguzhan
Prasanna Viktor K.
Yang Yi-Hua E.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 03/05/2011
Field of study

We propose a combined length-infix pipelined search (CLIPS) architecture for high-performance IP lookup on FPGA. By performing binary search in prefix length, CLIPS can find the longest prefix match in (log L - c) phases, where L is the IP address length (32 for IPv4) and c > 0 is a small design constant (c = 2 in our prototype design). Each CLIPS phase matches one or more input infixes of the same length against a regular data structure. Various CLIPS phases can be optimized individually: (1) 16 bits of the IP address are used to direct-access a 288-kbit on-chip BRAM in phase 1; (2) 8 additional bits of the IP address are used to search a 1.5-million-entry pipelined dynamic search forest for a match in phase 2; (3) 1 to 8 additional bits of the IP address are used by a 2-stage TreeBitmap for storing another 1 to 8 million routing prefixes in the tail phase. Post place-and-route results show that our CLIPS prototype, utilizing 28 Mbits on-chip BRAM and 4 external SRAM channels, sustains 312 MPPS IPv4 lookup (or 160 Gbps routing thoughput with 64-byte packets) against 9.5 million prefixes on state-of-the-art FPGA

OpenMETU (Middle East Technical University)