86 research outputs found

    Z-TCAM: An SRAM-based Architecture for TCAM

    Get PDF
    published_or_final_versio

    FPGA based Ternary Content Addressable Memory using SRAM

    Get PDF

    Ternary content addressable memory for longest prefix matching based on random access memory on field programmable gate array

    Get PDF
    Conventional ternary content addressable memory (TCAM) provides access to stored data, which consists of '0', '1' and ‘don't care’, and outputs the matched address. Content lookup in TCAM can be done in a single cycle, which makes it very important in applications such as address lookup and deep-packet inspection. This paper proposes an improved TCAM architecture with fast update functionality. To support longest prefix matching (LPM), LPM logic are needed to the proposed TCAM. The latency of the proposed LPM logic is dependent on the number of matching addresses in address prefix comparison. In order to improve the throughput, parallel LPM logic is added to improve the throughput by 10× compared to the one without. Although with resource overhead, the cost of throughput per bit is less as compared to the one without parallel LPM logic

    Design space explorations of Hybrid-Partitioned TCAM (HP-TCAM)

    Get PDF

    On using content addressable memory for packet classiïŹcation

    Get PDF
    Packet switched networks such as the Internet require packet classiïŹcation at every hop in order to ap-ply services and security policies to trafïŹc ïŹ‚ows. The relentless increase in link speeds and trafïŹc volume imposes astringent constraints on packet classiïŹcation solutions. Ternary Content Addressable Memory (TCAM) devices are favored by most network component and equipment vendors due to the fast and de-terministic lookup performance afforded by their use of massive parallelism. While able to keep up with high speed links, TCAMs suffer from exorbitant power consumption, poor scalability to longer search keys and larger ïŹlter sets, and inefïŹcient support of multiple matches. The research community has responded with algorithms that seek to meet the lookup rate constraint with greater efïŹciency through the use of com-modity Random Access Memory (RAM) technology. The most promising algorithms efïŹciently achieve high lookup rates by leveraging the statistical structure of real ïŹlter sets. Due to their dependence on ïŹlter set characteristics, it is difïŹcult to provision processing and memory resources for implementations that support a wide variety of ïŹlter sets. We show how several algorithmic advances may be leveraged to im-prove the efïŹciency, scalability, incremental update and multiple match performance of CAM-based packet classiïŹcation techniques without degrading the lookup performance. Our approach, Label Encoded Content Addressable Memory (LECAM), represents a hybrid technique that utilizes decomposition, label encoding, and a novel Content Addressable Memory (CAM) architecture. By reducing the number of implementation parameters, LECAM provides a vehicle to carry several of the recent algorithmic advances into practice. We provide a thorough overview of CAM technologies and packet classiïŹcation algorithms, along with a detailed discussion of the scaling issues that arise with longer search keys and larger ïŹlter sets. We also provide a comparative analysis of LECAM and standard TCAM using a collection of real and synthetic ïŹlter sets of various sizes and compositions

    A Scalable High-Performance Memory-Less IP Address Lookup Engine Suitable for FPGA Implementation

    Get PDF
    RÉSUMÉ La recherche d'adresse IP est une opĂ©ration trĂšs importante pour les routeurs Internet modernes. De nombreuses approches dans la littĂ©rature ont Ă©tĂ© proposĂ©es pour rĂ©aliser des moteurs de recherche d'adresse IP (Address Lookup Engine – ALE), Ă  haute performance. Les ALE existants peuvent ĂȘtre classĂ©s dans l’une ou l’autre de trois catĂ©gories basĂ©es sur: les mĂ©moires ternaires adressables par le contenu (TCAM), les Trie et les Ă©mulations de TCAM. Les approches qui se basent sur des TCAM sont coĂ»teuses et elles consomment beaucoup d'Ă©nergie. Les techniques qui exploitent les Trie ont une latence non dĂ©terministe qui nĂ©cessitent gĂ©nĂ©ralement des accĂšs Ă  une mĂ©moire externe. Les techniques qui exploitent des Ă©mulations de TCAM combinent gĂ©nĂ©ralement des TCAM avec des circuits Ă  faible coĂ»t. Dans ce mĂ©moire, l'objectif principal est de proposer une architecture d'ALE qui permet la recherche rapide d’adresses IP et qui apporte une solution aux principales lacunes des techniques basĂ©es sur des TCAM et sur des Trie. Atteindre une vitesse de traitement suffisante dans l'ALE est un aspect important. Des accĂ©lĂ©rateurs matĂ©riels ont Ă©tĂ© adoptĂ©s pour obtenir une le rĂ©sultat de recherche Ă  haute vitesse. Le FPGA permettent la mise en Ɠuvre d’accĂ©lĂ©rateurs matĂ©riels reconfigurables spĂ©cialisĂ©s. Cinq architectures d’ALE de type Ă©mulation de TCAM sont proposĂ©s dans ce mĂ©moire : une sĂ©rielle, une parallĂšle, une architecture dite IP-Split, une variante appelĂ©e IP-Split-Bucket et une version de l’IP-Split-Bucket qui supporte les mises Ă  jours. Chaque architecture est construite Ă  partir de l’architecture prĂ©cĂ©dente de maniĂšre progressive dans le but d’en amĂ©liorer les performances. L'architecture sĂ©rielle utilise des mĂ©moires pour stocker la table d’adresses de transmission et un comparateur pour effectuer une recherche sĂ©rielle sur les entrĂ©es. L'architecture parallĂšle stocke les entrĂ©es de la table dans les ressources logiques d’un FPGA, et elle emploie une recherche parallĂšle en utilisant N comparateurs pour une table avec N entrĂ©es. L’architecture IP-Split emploie un niveau de dĂ©codeurs pour Ă©viter des comparaisons rĂ©pĂ©titives dans les entrĂ©es Ă©quivalentes de la table. L'architecture IP-Split-Bucket est une version amĂ©liorĂ©e de l'architecture prĂ©cĂ©dente qui utilise une mĂ©thode de partitionnement visant Ă  optimiser l'architecture IP-Split. L’IP-Split-Bucket qui supporte les mises Ă  jour est la derniĂšre architecture proposĂ©e. Elle soutient la mise Ă  jour et la recherche Ă  haute vitesse d'adresses IP. Les rĂ©sultats d’implĂ©mentations montrent que l'architecture d’ALE qui offre les meilleures performances est l’IP-Split-Bucket, qui n’a pas recours Ă  une ou plusieurs mĂ©moires. Pour une table d’adresses de transmission IPv4 rĂ©elle comportant 524 k prĂ©fixes, l'architecture IP-Split-Bucket atteint un dĂ©bit de 103,4 M paquets par seconde et elle consomme respectivement 23% et 22% des tables de conversion (LUTs) et des bascules (FFs) sur une puce Xilinx XC7V2000T.----------ABSTRACT High-performance IP address lookup is highly demanded for modern Internet routers. Many approaches in the literature describe a special purpose Address Lookup Engines (ALE), for IP address lookup. The existing ALEs can be categorised into the following techniques: Ternary Content Addressable Memories-based (TCAM-based), trie-based and TCAM-emulation. TCAM-based techniques are expensive and consume a lot of power, since they employ TCAMs in their architecture. Trie-based techniques have nondeterministic latency and external memory accesses, since they store the Forwarding Information Base (FIB) in the memory using a trie data structure. TCAM-emulation techniques commonly combine TCAMs with lower-cost circuits that handle less time-critical activities. In this thesis, the main objective is to propose an ALE architecture with fast search that addresses the main shortcomings of TCAM-based and trie-based techniques. Achieving an admissible throughput in the proposed ALE is its fundamental requirement due to the recent improvements of network systems and growth of Internet of Things (IoTs). For that matter, hardware accelerators have been adopted to achieve a high speed search. In this work, Field Programmable Gate Arrays (FPGAs) are specialized reconfigurable hardware accelerators chosen as the target platform for the ALE architecture. Five TCAM-emulation ALE architectures are proposed in this thesis: the Full-Serial, the Full-Parallel, the IP-Split, the IP-Split-Bucket and the Update-enabled IP-Split-Bucket architectures. Each architecture builds on the previous one with progressive improvements. The Full-Serial architecture employs memories to store the FIB and one comparator to perform a serial search on the FIB entries. The Full-Parallel architecture stores the FIB entries into the logical resources of the FPGA and employs a parallel search using one comparator for each FIB entry. The IP-Split architecture employs a level of decoders to avoid repetitive comparisons in the equivalent entries of the FIB. The IP-Split-Bucket architecture is an upgraded version of the previous architecture using a partitioning scheme aiming to optimize the IP-Split architecture. Finally, the Update-enabled IP-Split-Bucket supports high-update rate IP address lookup. The most efficient proposed architecture is the IP-Split-Bucket, which is a novel high-performance memory-less ALE. For a real-world FIB with 524 k IPv4 prefixes, IP-Split-Bucket achieves a throughput of 103.4M packets per second and consumes respectively 23% and 22% of the Look Up Tables (LUTs) and Flip-Flops (FFs) of a Xilinx XC7V2000T chip

    Memory Management for Emerging Memory Technologies

    Get PDF
    The Memory Wall, or the gap between CPU speed and main memory latency, is ever increasing. The latency of Dynamic Random-Access Memory (DRAM) is now of the order of hundreds of CPU cycles. Additionally, the DRAM main memory is experiencing power, performance and capacity constraints that limit process technology scaling. On the other hand, the workloads running on such systems are themselves changing due to virtualization and cloud computing demanding more performance of the data centers. Not only do these workloads have larger working set sizes, but they are also changing the way memory gets used, resulting in higher sharing and increased bandwidth demands. New Non-Volatile Memory technologies (NVM) are emerging as an answer to the current main memory issues. This thesis looks at memory management issues as the emerging memory technologies get integrated into the memory hierarchy. We consider the problems at various levels in the memory hierarchy, including sharing of CPU LLC, traffic management to future non-volatile memories behind the LLC, and extending main memory through the employment of NVM. The first solution we propose is “Adaptive Replacement and Insertion" (ARI), an adaptive approach to last-level CPU cache management, optimizing the cache miss rate and writeback rate simultaneously. Our specific focus is to reduce writebacks as much as possible while maintaining or improving miss rate relative to conventional LRU replacement policy, with minimal hardware overhead. ARI reduces writebacks on benchmarks from SPEC2006 suite on average by 32.9% while also decreasing misses on average by 4.7%. In a PCM based memory system, this decreases energy consumption by 23% compared to LRU and provides a 49% lifetime improvement beyond what is possible with randomized wear-leveling. Our second proposal is “Variable-Timeslice Thread Scheduling" (VATS), an OS kernel-level approach to CPU cache sharing. With modern, large, last-level caches (LLC), the time to fill the LLC is greater than the OS scheduling window. As a result, when a thread aggressively thrashes the LLC by replacing much of the data in it, another thread may not be able to recover its working set before being rescheduled. We isolate the threads in time by increasing their allotted time quanta, and allowing larger periods of time between interfering threads. Our approach, compared to conventional scheduling, mitigates up to 100% of the performance loss caused by CPU LLC interference. The system throughput is boosted by up to 15%. As an unconventional approach to utilizing emerging memory technologies, we present a Ternary Content-Addressable Memory (TCAM) design with Flash transistors. TCAM is successfully used in network routing but can also be utilized in the OS Virtual Memory applications. Based on our layout and circuit simulation experiments, we conclude that our FTCAM block achieves an area improvement of 7.9× and a power improvement of 1.64× compared to a CMOS approach. In order to lower the cost of Main Memory in systems with huge memory demand, it is becoming practical to extend the DRAM in the system with the less-expensive NVMe Flash, for a much lower system cost. However, given the relatively high Flash devices access latency, naively using them as main memory leads to serious performance degradation. We propose OSVPP, a software-only, OS swap-based page prefetching scheme for managing such hybrid DRAM + NVM systems. We show that it is possible to gain about 50% of the lost performance due to swapping into the NVM and thus enable the utilization of such hybrid systems for memory-hungry applications, lowering the memory cost while keeping the performance comparable to the DRAM-only system

    Memory Management for Emerging Memory Technologies

    Get PDF
    The Memory Wall, or the gap between CPU speed and main memory latency, is ever increasing. The latency of Dynamic Random-Access Memory (DRAM) is now of the order of hundreds of CPU cycles. Additionally, the DRAM main memory is experiencing power, performance and capacity constraints that limit process technology scaling. On the other hand, the workloads running on such systems are themselves changing due to virtualization and cloud computing demanding more performance of the data centers. Not only do these workloads have larger working set sizes, but they are also changing the way memory gets used, resulting in higher sharing and increased bandwidth demands. New Non-Volatile Memory technologies (NVM) are emerging as an answer to the current main memory issues. This thesis looks at memory management issues as the emerging memory technologies get integrated into the memory hierarchy. We consider the problems at various levels in the memory hierarchy, including sharing of CPU LLC, traffic management to future non-volatile memories behind the LLC, and extending main memory through the employment of NVM. The first solution we propose is “Adaptive Replacement and Insertion" (ARI), an adaptive approach to last-level CPU cache management, optimizing the cache miss rate and writeback rate simultaneously. Our specific focus is to reduce writebacks as much as possible while maintaining or improving miss rate relative to conventional LRU replacement policy, with minimal hardware overhead. ARI reduces writebacks on benchmarks from SPEC2006 suite on average by 32.9% while also decreasing misses on average by 4.7%. In a PCM based memory system, this decreases energy consumption by 23% compared to LRU and provides a 49% lifetime improvement beyond what is possible with randomized wear-leveling. Our second proposal is “Variable-Timeslice Thread Scheduling" (VATS), an OS kernel-level approach to CPU cache sharing. With modern, large, last-level caches (LLC), the time to fill the LLC is greater than the OS scheduling window. As a result, when a thread aggressively thrashes the LLC by replacing much of the data in it, another thread may not be able to recover its working set before being rescheduled. We isolate the threads in time by increasing their allotted time quanta, and allowing larger periods of time between interfering threads. Our approach, compared to conventional scheduling, mitigates up to 100% of the performance loss caused by CPU LLC interference. The system throughput is boosted by up to 15%. As an unconventional approach to utilizing emerging memory technologies, we present a Ternary Content-Addressable Memory (TCAM) design with Flash transistors. TCAM is successfully used in network routing but can also be utilized in the OS Virtual Memory applications. Based on our layout and circuit simulation experiments, we conclude that our FTCAM block achieves an area improvement of 7.9× and a power improvement of 1.64× compared to a CMOS approach. In order to lower the cost of Main Memory in systems with huge memory demand, it is becoming practical to extend the DRAM in the system with the less-expensive NVMe Flash, for a much lower system cost. However, given the relatively high Flash devices access latency, naively using them as main memory leads to serious performance degradation. We propose OSVPP, a software-only, OS swap-based page prefetching scheme for managing such hybrid DRAM + NVM systems. We show that it is possible to gain about 50% of the lost performance due to swapping into the NVM and thus enable the utilization of such hybrid systems for memory-hungry applications, lowering the memory cost while keeping the performance comparable to the DRAM-only system
    • 

    corecore