193 research outputs found
Ternary content addressable memory for longest prefix matching based on random access memory on field programmable gate array
Conventional ternary content addressable memory (TCAM) provides access to stored data, which consists of '0', '1' and âdon't careâ, and outputs the matched address. Content lookup in TCAM can be done in a single cycle, which makes it very important in applications such as address lookup and deep-packet inspection. This paper proposes an improved TCAM architecture with fast update functionality. To support longest prefix matching (LPM), LPM logic are needed to the proposed TCAM. The latency of the proposed LPM logic is dependent on the number of matching addresses in address prefix comparison. In order to improve the throughput, parallel LPM logic is added to improve the throughput by 10Ă compared to the one without. Although with resource overhead, the cost of throughput per bit is less as compared to the one without parallel LPM logic
On using content addressable memory for packet classiïŹcation
Packet switched networks such as the Internet require packet classiïŹcation at every hop in order to ap-ply services and security policies to trafïŹc ïŹows. The relentless increase in link speeds and trafïŹc volume imposes astringent constraints on packet classiïŹcation solutions. Ternary Content Addressable Memory (TCAM) devices are favored by most network component and equipment vendors due to the fast and de-terministic lookup performance afforded by their use of massive parallelism. While able to keep up with high speed links, TCAMs suffer from exorbitant power consumption, poor scalability to longer search keys and larger ïŹlter sets, and inefïŹcient support of multiple matches. The research community has responded with algorithms that seek to meet the lookup rate constraint with greater efïŹciency through the use of com-modity Random Access Memory (RAM) technology. The most promising algorithms efïŹciently achieve high lookup rates by leveraging the statistical structure of real ïŹlter sets. Due to their dependence on ïŹlter set characteristics, it is difïŹcult to provision processing and memory resources for implementations that support a wide variety of ïŹlter sets. We show how several algorithmic advances may be leveraged to im-prove the efïŹciency, scalability, incremental update and multiple match performance of CAM-based packet classiïŹcation techniques without degrading the lookup performance. Our approach, Label Encoded Content Addressable Memory (LECAM), represents a hybrid technique that utilizes decomposition, label encoding, and a novel Content Addressable Memory (CAM) architecture. By reducing the number of implementation parameters, LECAM provides a vehicle to carry several of the recent algorithmic advances into practice. We provide a thorough overview of CAM technologies and packet classiïŹcation algorithms, along with a detailed discussion of the scaling issues that arise with longer search keys and larger ïŹlter sets. We also provide a comparative analysis of LECAM and standard TCAM using a collection of real and synthetic ïŹlter sets of various sizes and compositions
Bridging the Gap: FPGAs as Programmable Switches
The emergence of P4, a domain specific language, coupled to PISA, a domain
specific architecture, is revolutionizing the networking field. P4 allows to
describe how packets are processed by a programmable data plane, spanning ASICs
and CPUs, implementing PISA. Because the processing flexibility can be limited
on ASICs, while the CPUs performance for networking tasks lag behind, recent
works have proposed to implement PISA on FPGAs. However, little effort has been
dedicated to analyze whether FPGAs are good candidates to implement PISA. In
this work, we take a step back and evaluate the micro-architecture efficiency
of various PISA blocks. We demonstrate, supported by a theoretical and
experimental analysis, that the performance of a few PISA blocks is severely
limited by the current FPGA architectures. Specifically, we show that match
tables and programmable packet schedulers represent the main performance
bottlenecks for FPGA-based programmable switches. Thus, we explore two avenues
to alleviate these shortcomings. First, we identify network applications well
tailored to current FPGAs. Second, to support a wider range of networking
applications, we propose modifications to the FPGA architectures which can also
be of interest out of the networking field.Comment: To be published in : IEEE International Conference on High
Performance Switching and Routing 202
A Scalable High-Performance Memory-Less IP Address Lookup Engine Suitable for FPGA Implementation
RĂSUMĂ
La recherche d'adresse IP est une opĂ©ration trĂšs importante pour les routeurs Internet modernes. De nombreuses approches dans la littĂ©rature ont Ă©tĂ© proposĂ©es pour rĂ©aliser des moteurs de recherche d'adresse IP (Address Lookup Engine â ALE), Ă haute performance. Les ALE existants peuvent ĂȘtre classĂ©s dans lâune ou lâautre de trois catĂ©gories basĂ©es sur: les mĂ©moires ternaires adressables par le contenu (TCAM), les Trie et les Ă©mulations de TCAM. Les approches qui se basent sur des TCAM sont coĂ»teuses et elles consomment beaucoup d'Ă©nergie. Les techniques qui exploitent les Trie ont une latence non dĂ©terministe qui nĂ©cessitent gĂ©nĂ©ralement des accĂšs Ă une mĂ©moire externe. Les techniques qui exploitent des Ă©mulations de TCAM combinent gĂ©nĂ©ralement des TCAM avec des circuits Ă faible coĂ»t. Dans ce mĂ©moire, l'objectif principal est de proposer une architecture d'ALE qui permet la recherche rapide dâadresses IP et qui apporte une solution aux principales lacunes des techniques basĂ©es sur des TCAM et sur des Trie.
Atteindre une vitesse de traitement suffisante dans l'ALE est un aspect important. Des accĂ©lĂ©rateurs matĂ©riels ont Ă©tĂ© adoptĂ©s pour obtenir une le rĂ©sultat de recherche Ă haute vitesse. Le FPGA permettent la mise en Ćuvre dâaccĂ©lĂ©rateurs matĂ©riels reconfigurables spĂ©cialisĂ©s. Cinq architectures dâALE de type Ă©mulation de TCAM sont proposĂ©s dans ce mĂ©moire : une sĂ©rielle, une parallĂšle, une architecture dite IP-Split, une variante appelĂ©e IP-Split-Bucket et une version de lâIP-Split-Bucket qui supporte les mises Ă jours. Chaque architecture est construite Ă partir de lâarchitecture prĂ©cĂ©dente de maniĂšre progressive dans le but dâen amĂ©liorer les performances.
L'architecture sĂ©rielle utilise des mĂ©moires pour stocker la table dâadresses de transmission et un comparateur pour effectuer une recherche sĂ©rielle sur les entrĂ©es. L'architecture parallĂšle stocke les entrĂ©es de la table dans les ressources logiques dâun FPGA, et elle emploie une recherche parallĂšle en utilisant N comparateurs pour une table avec N entrĂ©es. Lâarchitecture IP-Split emploie un niveau de dĂ©codeurs pour Ă©viter des comparaisons rĂ©pĂ©titives dans les entrĂ©es Ă©quivalentes de la table. L'architecture IP-Split-Bucket est une version amĂ©liorĂ©e de l'architecture prĂ©cĂ©dente qui utilise une mĂ©thode de partitionnement visant Ă optimiser l'architecture IP-Split. LâIP-Split-Bucket qui supporte les mises Ă jour est la derniĂšre architecture proposĂ©e. Elle soutient la mise Ă jour et la recherche Ă haute vitesse d'adresses IP. Les rĂ©sultats dâimplĂ©mentations montrent que l'architecture dâALE qui offre les meilleures performances est lâIP-Split-Bucket, qui nâa pas recours Ă une ou plusieurs mĂ©moires. Pour une table dâadresses de transmission IPv4 rĂ©elle comportant 524 k prĂ©fixes, l'architecture IP-Split-Bucket atteint un dĂ©bit de 103,4 M paquets par seconde et elle consomme respectivement 23% et 22% des tables de conversion (LUTs) et des bascules (FFs) sur une puce Xilinx XC7V2000T.----------ABSTRACT
High-performance IP address lookup is highly demanded for modern Internet routers. Many approaches in the literature describe a special purpose Address Lookup Engines (ALE), for IP address lookup. The existing ALEs can be categorised into the following techniques: Ternary Content Addressable Memories-based (TCAM-based), trie-based and TCAM-emulation. TCAM-based techniques are expensive and consume a lot of power, since they employ TCAMs in their architecture. Trie-based techniques have nondeterministic latency and external memory accesses, since they store the Forwarding Information Base (FIB) in the memory using a trie data structure. TCAM-emulation techniques commonly combine TCAMs with lower-cost circuits that handle less time-critical activities. In this thesis, the main objective is to propose an ALE architecture with fast search that addresses the main shortcomings of TCAM-based and trie-based techniques.
Achieving an admissible throughput in the proposed ALE is its fundamental requirement due to the recent improvements of network systems and growth of Internet of Things (IoTs). For that matter, hardware accelerators have been adopted to achieve a high speed search. In this work, Field Programmable Gate Arrays (FPGAs) are specialized reconfigurable hardware accelerators chosen as the target platform for the ALE architecture. Five TCAM-emulation ALE architectures are proposed in this thesis: the Full-Serial, the Full-Parallel, the IP-Split, the IP-Split-Bucket and the Update-enabled IP-Split-Bucket architectures. Each architecture builds on the previous one with progressive improvements.
The Full-Serial architecture employs memories to store the FIB and one comparator to perform a serial search on the FIB entries. The Full-Parallel architecture stores the FIB entries into the logical resources of the FPGA and employs a parallel search using one comparator for each FIB entry. The IP-Split architecture employs a level of decoders to avoid repetitive comparisons in the equivalent entries of the FIB. The IP-Split-Bucket architecture is an upgraded version of the previous architecture using a partitioning scheme aiming to optimize the IP-Split architecture. Finally, the Update-enabled IP-Split-Bucket supports high-update rate IP address lookup. The most efficient proposed architecture is the IP-Split-Bucket, which is a novel high-performance memory-less ALE. For a real-world FIB with 524 k IPv4 prefixes, IP-Split-Bucket achieves a throughput of 103.4M packets per second and consumes respectively 23% and 22% of the Look Up Tables (LUTs) and Flip-Flops (FFs) of a Xilinx XC7V2000T chip
Technology Mapping for Circuit Optimization Using Content-Addressable Memory
The growing complexity of Field Programmable Gate Arrays (FPGA's) is leading to architectures with high input cardinality look-up tables (LUT's). This thesis describes a methodology for area-minimizing technology mapping for combinational logic, specifically designed for such FPGA architectures. This methodology, called LURU, leverages the parallel search capabilities of Content-Addressable Memories (CAM's) to outperform traditional mapping algorithms in both execution time and quality of results. The LURU algorithm is fundamentally different from other techniques for technology mapping in that LURU uses textual string representations of circuit topology in order to efficiently store and search for circuit patterns in a CAM. A circuit is mapped to the target LUT technology using both exact and inexact string matching techniques. Common subcircuit expressions (CSE's) are also identified and used for architectural optimization---a small set of CSE's is shown to effectively cover an average of 96% of the test circuits. LURU was tested with the ISCAS'85 suite of combinational benchmark circuits and compared with the mapping algorithms FlowMap and CutMap. The area reduction shown by LURU is, on average, 20% better compared to FlowMap and CutMap. The asymptotic runtime complexity of LURU is shown to be better than that of both FlowMap and CutMap
Analog Content-Addressable Memory from Complementary FeFETs
To address the increasing computational demands of artificial intelligence
(AI) and big data, compute-in-memory (CIM) integrates memory and processing
units into the same physical location, reducing the time and energy overhead of
the system. Despite advancements in non-volatile memory (NVM) for matrix
multiplication, other critical data-intensive operations, like parallel search,
have been overlooked. Current parallel search architectures, namely
content-addressable memory (CAM), often use binary, which restricts density and
functionality. We present an analog CAM (ACAM) cell, built on two complementary
ferroelectric field-effect transistors (FeFETs), that performs parallel search
in the analog domain with over 40 distinct match windows. We then deploy it to
calculate similarity between vectors, a building block in the following two
machine learning problems. ACAM outperforms ternary CAM (TCAM) when applied to
similarity search for few-shot learning on the Omniglot dataset, yielding
projected simulation results with improved inference accuracy by 5%, 3x denser
memory architecture, and more than 100x faster speed compared to central
processing unit (CPU) and graphics processing unit (GPU) per similarity search
on scaled CMOS nodes. We also demonstrate 1-step inference on a kernel
regression model by combining non-linear kernel computation and matrix
multiplication in ACAM, with simulation estimates indicating 1,000x faster
inference than CPU and GPU
- âŠ