73 research outputs found

    Reconfigurable Data Planes for Scalable Network Virtualization

    Get PDF
    Abstract—Network virtualization presents a powerful approach to share physical network infrastructure among multiple virtual networks. Recent advances in network virtualization advocate the use of field-programmable gate arrays (FPGAs) as flexible high performance alternatives to conventional host virtualization techniques. However, the limited on-chip logic and memory resources in FPGAs severely restrict the scalability of the virtualization platform and necessitate the implementation of efficient forwarding structures in hardware. The research described in this manuscript explores the implementation of a scalable heterogeneous network virtualization platform which integrates virtual data planes implemented in FPGAs with software data planes created using host virtualization techniques. The system exploits data plane heterogeneity to cater to the dynamic service requirements of virtual networks by migrating networks between software and hardware data planes. We demonstrate data plane migration as an effective technique to limit the impact of traffic on unmodified data planes during FPGA reconfiguration. Our system implements forwarding tables in a shared fashion using inexpensive off-chip memories and supports both Internet Protocol (IP) and non-IP based data planes. Experimental results show that FPGA-based data planes can offer two orders of magnitude better throughput than their software counterparts and FPGA reconfiguration can facilitate data plane customization within 12 seconds. An integrated system that supports up to 15 virtual networks has been validated on the NetFPGA platform

    A Scalable High-Performance Memory-Less IP Address Lookup Engine Suitable for FPGA Implementation

    Get PDF
    RÉSUMÉ La recherche d'adresse IP est une opĂ©ration trĂšs importante pour les routeurs Internet modernes. De nombreuses approches dans la littĂ©rature ont Ă©tĂ© proposĂ©es pour rĂ©aliser des moteurs de recherche d'adresse IP (Address Lookup Engine – ALE), Ă  haute performance. Les ALE existants peuvent ĂȘtre classĂ©s dans l’une ou l’autre de trois catĂ©gories basĂ©es sur: les mĂ©moires ternaires adressables par le contenu (TCAM), les Trie et les Ă©mulations de TCAM. Les approches qui se basent sur des TCAM sont coĂ»teuses et elles consomment beaucoup d'Ă©nergie. Les techniques qui exploitent les Trie ont une latence non dĂ©terministe qui nĂ©cessitent gĂ©nĂ©ralement des accĂšs Ă  une mĂ©moire externe. Les techniques qui exploitent des Ă©mulations de TCAM combinent gĂ©nĂ©ralement des TCAM avec des circuits Ă  faible coĂ»t. Dans ce mĂ©moire, l'objectif principal est de proposer une architecture d'ALE qui permet la recherche rapide d’adresses IP et qui apporte une solution aux principales lacunes des techniques basĂ©es sur des TCAM et sur des Trie. Atteindre une vitesse de traitement suffisante dans l'ALE est un aspect important. Des accĂ©lĂ©rateurs matĂ©riels ont Ă©tĂ© adoptĂ©s pour obtenir une le rĂ©sultat de recherche Ă  haute vitesse. Le FPGA permettent la mise en Ɠuvre d’accĂ©lĂ©rateurs matĂ©riels reconfigurables spĂ©cialisĂ©s. Cinq architectures d’ALE de type Ă©mulation de TCAM sont proposĂ©s dans ce mĂ©moire : une sĂ©rielle, une parallĂšle, une architecture dite IP-Split, une variante appelĂ©e IP-Split-Bucket et une version de l’IP-Split-Bucket qui supporte les mises Ă  jours. Chaque architecture est construite Ă  partir de l’architecture prĂ©cĂ©dente de maniĂšre progressive dans le but d’en amĂ©liorer les performances. L'architecture sĂ©rielle utilise des mĂ©moires pour stocker la table d’adresses de transmission et un comparateur pour effectuer une recherche sĂ©rielle sur les entrĂ©es. L'architecture parallĂšle stocke les entrĂ©es de la table dans les ressources logiques d’un FPGA, et elle emploie une recherche parallĂšle en utilisant N comparateurs pour une table avec N entrĂ©es. L’architecture IP-Split emploie un niveau de dĂ©codeurs pour Ă©viter des comparaisons rĂ©pĂ©titives dans les entrĂ©es Ă©quivalentes de la table. L'architecture IP-Split-Bucket est une version amĂ©liorĂ©e de l'architecture prĂ©cĂ©dente qui utilise une mĂ©thode de partitionnement visant Ă  optimiser l'architecture IP-Split. L’IP-Split-Bucket qui supporte les mises Ă  jour est la derniĂšre architecture proposĂ©e. Elle soutient la mise Ă  jour et la recherche Ă  haute vitesse d'adresses IP. Les rĂ©sultats d’implĂ©mentations montrent que l'architecture d’ALE qui offre les meilleures performances est l’IP-Split-Bucket, qui n’a pas recours Ă  une ou plusieurs mĂ©moires. Pour une table d’adresses de transmission IPv4 rĂ©elle comportant 524 k prĂ©fixes, l'architecture IP-Split-Bucket atteint un dĂ©bit de 103,4 M paquets par seconde et elle consomme respectivement 23% et 22% des tables de conversion (LUTs) et des bascules (FFs) sur une puce Xilinx XC7V2000T.----------ABSTRACT High-performance IP address lookup is highly demanded for modern Internet routers. Many approaches in the literature describe a special purpose Address Lookup Engines (ALE), for IP address lookup. The existing ALEs can be categorised into the following techniques: Ternary Content Addressable Memories-based (TCAM-based), trie-based and TCAM-emulation. TCAM-based techniques are expensive and consume a lot of power, since they employ TCAMs in their architecture. Trie-based techniques have nondeterministic latency and external memory accesses, since they store the Forwarding Information Base (FIB) in the memory using a trie data structure. TCAM-emulation techniques commonly combine TCAMs with lower-cost circuits that handle less time-critical activities. In this thesis, the main objective is to propose an ALE architecture with fast search that addresses the main shortcomings of TCAM-based and trie-based techniques. Achieving an admissible throughput in the proposed ALE is its fundamental requirement due to the recent improvements of network systems and growth of Internet of Things (IoTs). For that matter, hardware accelerators have been adopted to achieve a high speed search. In this work, Field Programmable Gate Arrays (FPGAs) are specialized reconfigurable hardware accelerators chosen as the target platform for the ALE architecture. Five TCAM-emulation ALE architectures are proposed in this thesis: the Full-Serial, the Full-Parallel, the IP-Split, the IP-Split-Bucket and the Update-enabled IP-Split-Bucket architectures. Each architecture builds on the previous one with progressive improvements. The Full-Serial architecture employs memories to store the FIB and one comparator to perform a serial search on the FIB entries. The Full-Parallel architecture stores the FIB entries into the logical resources of the FPGA and employs a parallel search using one comparator for each FIB entry. The IP-Split architecture employs a level of decoders to avoid repetitive comparisons in the equivalent entries of the FIB. The IP-Split-Bucket architecture is an upgraded version of the previous architecture using a partitioning scheme aiming to optimize the IP-Split architecture. Finally, the Update-enabled IP-Split-Bucket supports high-update rate IP address lookup. The most efficient proposed architecture is the IP-Split-Bucket, which is a novel high-performance memory-less ALE. For a real-world FIB with 524 k IPv4 prefixes, IP-Split-Bucket achieves a throughput of 103.4M packets per second and consumes respectively 23% and 22% of the Look Up Tables (LUTs) and Flip-Flops (FFs) of a Xilinx XC7V2000T chip

    Towards Terabit Carrier Ethernet and Energy Efficient Optical Transport Networks

    Get PDF

    A null convention logic based platform for high speed low energy IP packet forwarding

    Get PDF
    By 2020, it is predicted that there will be over 5 billion people and 38.5 billion Internet-ofThings devices on the Internet. The data generated by all these users and devices will have to be transported quickly and efficiently. Routers forming the backbone of this Internet already support multiple 100 Gbps ports meaning that they would have to perform upwards of 200 Million destination addresses lookups per second in the packet forwarding block that lies in the router ‘data-path’. At the same time, there is also a huge demand to make the network infrastructure more energy efficient. The work presented in this thesis is motivated by the observation that traditional synchronous digital systems will have increasing difficulty keeping up with these conflicting demands. Further, with reducing device geometries, extremes in “process, voltage and temperature” (PVT) variability will undermine reliable synchronous operation. It is expected that asynchronous design techniques will be able to overcome many of these problems and offer a means of lowering energy while maintaining high throughput and low latency. This thesis investigates existing address lookup algorithms and investigates the possibility of combining various approaches to improve energy efficiency without affecting lookup performance. A quasi delay-insensitive asynchronous methodology - Null Convention Logic (NCL) - is then applied to this combined design. Techniques that take advantage of the characteristics of the design methodology and the lookup algorithm to further improve the area, energy and latency characteristics are also analysed. The IP address lookup scheme utilised here is a recent algorithmic approach that uses compact binary-tries and was selected for its high memory efficiency and throughput. The design is pipelined, and the prefix information is stored in large RAMs. A Boolean synchronous implementation of the algorithm is simulated to provide an initial performance benchmark. It is observed that during the address lookup process nearly 68% of the trie accesses are to nodes that contained no prefix information. Bloom filter structures that use non-cryptographic hashes and single-bit memory are introduced into the address lookup process to prevent these unnecessary accesses, thereby reducing the energy consumption. Three non-cryptographic hashing algorithms (CRC32, Jenkins and Murmur) are also analysed for their suitability in Bloom filters, and the CRC32 is found to offer the most suitable trade-off between complexity and performance. As a first step to applying the NCL design methodology, NCL implementations of the hashing algorithms are created and evaluated. A significant finding from these experiments is that, unlike Boolean systems, latency and throughput in NCL systems are only loosely coupled. An example Jenkins hash implementation with eight pipeline stages and a cycle time of 3.2 ns exhibits a total latency of 6 ns, whereas an equivalent synchronous implementation with a similar clock period exhibits a latency of 25.6 ns. Further investigations reveal that completion detection circuits within the NCL pipelines impair throughput significantly. Two enhancements to the NCL circuit library aimed particularly at optimising NCL completion detection are proposed and analysed. These are shown to enable completion detection circuits to be built with the same delay but with 30% smaller area and about 75% lower peak current compared to the conventional approach using gates from the standard NCL library. An NCL SRAM structure is also proposed to augment the conventional 6-T cell array with circuits to generate the handshaking signals for managing the NCL data flow. Additionally, a dedicated column of cells called the Null-storage column is added, which indicates if a particular address in the RAM stores no Data, i.e., it is in its Null state. This additional hardware imposes a small area overhead of about 10% but allows accesses to Null locations to be completed in 50% less time and consume 40% less energy than accesses to valid Data locations. An experimental NCL-based address lookup system is then designed that includes all of the developed NCL modules. Statistical delay models derived from circuit-level simulations of individual modules are used to emulate realistic circuit delay variability in the behavioural modules written in Verilog. Simulations of the assembled system demonstrate that unlike what was observed with the synchronous design, with NCL, the design that does not employ Bloom filters, but only the Null-storage column RAMs for prefix storage, exhibits the smallest area on the chip and also consumes the least energy per address lookup. It is concluded that to derive maximum benefit out of an asynchronous design approach; it is necessary to carefully select the architectural blocks that combine the peculiarities of the implemented algorithm with the capabilities of the NCL design methodology

    Algorithms and Architectures for Network Search Processors

    Get PDF
    The continuous growth in the Internet’s size, the amount of data traïŹƒc, and the complexity of processing this traïŹƒc gives rise to new challenges in building high-performance network devices. One of the most fundamental tasks performed by these devices is searching the network data for predeïŹned keys. Address lookup, packet classiïŹcation, and deep packet inspection are some of the operations which involve table lookups and searching. These operations are typically part of the packet forwarding mechanism, and can create a performance bottleneck. Therefore, fast and resource eïŹƒcient algorithms are required. One of the most commonly used techniques for such searching operations is the Ternary Content Addressable Memory (TCAM). While TCAM can oïŹ€er very fast search speeds, it is costly and consumes a large amount of power. Hence, designing cost-eïŹ€ective, power-eïŹƒcient, and high-speed search techniques has received a great deal of attention in the research and industrial community. In this thesis, we propose a generic search technique based on Bloom ïŹlters. A Bloom ïŹlter is a randomized data structure used to represent a set of bit-strings compactly and support set membership queries. We demonstrate techniques to convert the search process into table lookups. The resulting table data structures are kept in the oïŹ€-chip memory and their Bloom ïŹlter representations are kept in the on-chip memory. An item needs to be looked up in the oïŹ€-chip table only when it is found in the on-chip Bloom ïŹlters. By ïŹltering the oïŹ€-chip memory accesses in this fashion, the search operations can be signiïŹcantly accelerated. Our approach involves a unique combination of algorithmic and architectural techniques that outperform some of the current techniques in terms of cost-eïŹ€ectiveness, speed, and power-eïŹƒciency
    • 

    corecore