20 research outputs found

    On using content addressable memory for packet classification

    Get PDF
    Packet switched networks such as the Internet require packet classification at every hop in order to ap-ply services and security policies to traffic flows. The relentless increase in link speeds and traffic volume imposes astringent constraints on packet classification solutions. Ternary Content Addressable Memory (TCAM) devices are favored by most network component and equipment vendors due to the fast and de-terministic lookup performance afforded by their use of massive parallelism. While able to keep up with high speed links, TCAMs suffer from exorbitant power consumption, poor scalability to longer search keys and larger filter sets, and inefficient support of multiple matches. The research community has responded with algorithms that seek to meet the lookup rate constraint with greater efficiency through the use of com-modity Random Access Memory (RAM) technology. The most promising algorithms efficiently achieve high lookup rates by leveraging the statistical structure of real filter sets. Due to their dependence on filter set characteristics, it is difficult to provision processing and memory resources for implementations that support a wide variety of filter sets. We show how several algorithmic advances may be leveraged to im-prove the efficiency, scalability, incremental update and multiple match performance of CAM-based packet classification techniques without degrading the lookup performance. Our approach, Label Encoded Content Addressable Memory (LECAM), represents a hybrid technique that utilizes decomposition, label encoding, and a novel Content Addressable Memory (CAM) architecture. By reducing the number of implementation parameters, LECAM provides a vehicle to carry several of the recent algorithmic advances into practice. We provide a thorough overview of CAM technologies and packet classification algorithms, along with a detailed discussion of the scaling issues that arise with longer search keys and larger filter sets. We also provide a comparative analysis of LECAM and standard TCAM using a collection of real and synthetic filter sets of various sizes and compositions

    Models, Algorithms, and Architectures for Scalable Packet Classification

    Get PDF
    The growth and diversification of the Internet imposes increasing demands on the performance and functionality of network infrastructure. Routers, the devices responsible for the switch-ing and directing of traffic in the Internet, are being called upon to not only handle increased volumes of traffic at higher speeds, but also impose tighter security policies and provide support for a richer set of network services. This dissertation addresses the searching tasks performed by Internet routers in order to forward packets and apply network services to packets belonging to defined traffic flows. As these searching tasks must be performed for each packet traversing the router, the speed and scalability of the solutions to the route lookup and packet classification problems largely determine the realizable performance of the router, and hence the Internet as a whole. Despite the energetic attention of the academic and corporate research communities, there remains a need for search engines that scale to support faster communication links, larger route tables and filter sets and increasingly complex filters. The major contributions of this work include the design and analysis of a scalable hardware implementation of a Longest Prefix Matching (LPM) search engine for route lookup, a survey and taxonomy of packet classification techniques, a thorough analysis of packet classification filter sets, the design and analysis of a suite of performance evaluation tools for packet classification algorithms and devices, and a new packet classification algorithm that scales to support high-speed links and large filter sets classifying on additional packet fields

    Low-Power High-Performance Ternary Content Addressable Memory Circuits

    Get PDF
    Ternary content addressable memories (TCAMs) are hardware-based parallel lookup tables with bit-level masking capability. They are attractive for applications such as packet forwarding and classification in network routers. Despite the attractive features of TCAMs, high power consumption is one of the most critical challenges faced by TCAM designers. This work proposes circuit techniques for reducing TCAM power consumption. The main contribution of this work is divided in two parts: (i) reduction in match line (ML) sensing energy, and (ii) static-power reduction techniques. The ML sensing energy is reduced by employing (i) positive-feedback ML sense amplifiers (MLSAs), (ii) low-capacitance comparison logic, and (iii) low-power ML-segmentation techniques. The positive-feedback MLSAs include both resistive and active feedback to reduce the ML sensing energy. A body-bias technique can further improve the feedback action at the expense of additional area and ML capacitance. The measurement results of the active-feedback MLSA show 50-56% reduction in ML sensing energy. The measurement results of the proposed low-capacitance comparison logic show 25% and 42% reductions in ML sensing energy and time, respectively, which can further be improved by careful layout. The low-power ML-segmentation techniques include dual ML TCAM and charge-shared ML. Simulation results of the dual ML TCAM that connects two sides of the comparison logic to two ML segments for sequential sensing show 43% power savings for a small (4%) trade-off in the search speed. The charge-shared ML scheme achieves power savings by partial recycling of the charge stored in the first ML segment. Chip measurement results show that the charge-shared ML scheme results in 11% and 9% reductions in ML sensing time and energy, respectively, which can be improved to 19-25% by using a digitally controlled charge sharing time-window and a slightly modified MLSA. The static power reduction is achieved by a dual-VDD technique and low-leakage TCAM cells. The dual-VDD technique trades-off the excess noise margin of MLSA for smaller cell leakage by applying a smaller VDD to TCAM cells and a larger VDD to the peripheral circuits. The low-leakage TCAM cells trade off the speed of READ and WRITE operations for smaller cell area and leakage. Finally, design and testing of a complete TCAM chip are presented, and compared with other published designs

    Techniques d'abstraction pour l'analyse et la mitigation des effets dus à la radiation

    Get PDF
    The main objective of this thesis is to develop techniques that can beused to analyze and mitigate the effects of radiation-induced soft errors in industrialscale integrated circuits. To achieve this goal, several methods have been developedbased on analyzing the design at higher levels of abstraction. These techniquesaddress both sequential and combinatorial SER.Fault-injection simulations remain the primary method for analyzing the effectsof soft errors. In this thesis, techniques which significantly speed-up fault-injectionsimulations are presented. Soft errors in flip-flops are typically mitigated by selectivelyreplacing the most critical flip-flops with hardened implementations. Selectingan optimal set to harden is a compute intensive problem and the second contributionconsists of a clustering technique which significantly reduces the number offault-injections required to perform selective mitigation.In terrestrial applications, the effect of soft errors in combinatorial logic hasbeen fairly small. It is known that this effect is growing, yet there exist few techniqueswhich can quickly estimate the extent of combinatorial SER for an entireintegrated circuit. The third contribution of this thesis is a hierarchical approachto combinatorial soft error analysis.Systems-on-chip are often developed by re-using design-blocks that come frommultiple sources. In this context, there is a need to develop and exchange reliabilitymodels. The final contribution of this thesis consists of an application specificmodeling language called RIIF (Reliability Information Interchange Format). Thislanguage is able to model how faults at the gate-level propagate up to the block andchip-level. Work is underway to standardize the RIIF modeling language as well asto extend it beyond modeling of radiation-induced failures.In addition to the main axis of research, some tangential topics were studied incollaboration with other teams. One of these consisted in the development of a novelapproach for protecting ternary content addressable memories (TCAMs), a specialtype of memory important in networking applications. The second supplementalproject resulted in an algorithm for quickly generating approximate redundant logicwhich can protect combinatorial networks against permanent faults. Finally anapproach for reducing the detection time for errors in the configuration RAM forField-Programmable Gate-Arrays (FPGAs) was outlined.Les effets dus à la radiation peuvent provoquer des pannes dans des circuits intégrés. Lorsqu'une particule subatomique, fait se déposer une charge dans les régions sensibles d'un transistor cela provoque une impulsion de courant. Cette impulsion peut alors engendrer l'inversion d'un bit ou se propager dans un réseau de logique combinatoire avant d'être échantillonnée par une bascule en aval.Selon l'état du circuit au moment de la frappe de la particule et selon l'application, cela provoquera une panne observable ou non. Parmi les événements induits par la radiation, seule une petite portion génère des pannes. Il est donc essentiel de déterminer cette fraction afin de prédire la fiabilité du système. En effet, les raisons pour lesquelles une perturbation pourrait être masquée sont multiples, et il est de plus parfois difficile de préciser ce qui constitue une erreur. A cela s'ajoute le fait que les circuits intégrés comportent des milliards de transistors. Comme souvent dans le contexte de la conception assisté par ordinateur, les approches hiérarchiques et les techniques d'abstraction permettent de trouver des solutions.Cette thèse propose donc plusieurs nouvelles techniques pour analyser les effets dus à la radiation. La première technique permet d'accélérer des simulations d'injections de fautes en détectant lorsqu'une faute a été supprimée du système, permettant ainsi d'arrêter la simulation. La deuxième technique permet de regrouper en ensembles les éléments d'un circuit ayant une fonction similaire. Ensuite, une analyse au niveau des ensemble peut être faite, identifiant ainsi ceux qui sont les plus critiques et qui nécessitent donc d'être durcis. Le temps de calcul est ainsi grandement réduit.La troisième technique permet d'analyser les effets des fautes transitoires dans les circuits combinatoires. Il est en effet possible de calculer à l'avance la sensibilité à des fautes transitoires de cellules ainsi que les effets de masquage dans des blocs fréquemment utilisés. Ces modèles peuvent alors être combinés afin d'analyser la sensibilité de grands circuits. La contribution finale de cette thèse consiste en la définition d'un nouveau langage de modélisation appelé RIIF (Reliability Information Ineterchange Format). Ce langage permet de décrire le taux des fautes dans des composants simples en fonction de leur environnement de fonctionnement. Ces composants simples peuvent ensuite être combinés permettant ainsi de modéliser la propagation de leur fautes vers des pannes au niveau système. En outre, l'utilisation d'un langage standard facilite l'échange de données de fiabilité entre les partenaires industriels.Au-delà des contributions principales, cette thèse aborde aussi des techniques permettant de protéger des mémoires associatives ternaires (TCAMs). Les approches classiques de protection (codes correcteurs) ne s'appliquent pas directement. Une des nouvelles techniques proposées consiste à utiliser une structure de données qui peut détecter, d'une manière statistique, quand le résultat n'est pas correct. La probabilité de détection peut être contrôlée par le nombre de bits alloués à cette structure. Une autre technique consiste à utiliser un détecteur de courant embarqué (BICS) afin de diriger un processus de fond directement vers le région touchée par une erreur. La contribution finale consiste en un algorithme qui permet de synthétiser de la logique combinatoire afin de protéger des circuits combinatoires contre les fautes transitoires.Dans leur ensemble, ces techniques facilitent l'analyse des erreurs provoquées par les effets dus à la radiation dans les circuits intégrés, en particulier pour les très grands circuits composés de blocs provenant de divers fournisseurs. Des techniques pour mieux sélectionner les bascules/flip-flops à durcir et des approches pour protéger des TCAMs ont étés étudiées

    A null convention logic based platform for high speed low energy IP packet forwarding

    Get PDF
    By 2020, it is predicted that there will be over 5 billion people and 38.5 billion Internet-ofThings devices on the Internet. The data generated by all these users and devices will have to be transported quickly and efficiently. Routers forming the backbone of this Internet already support multiple 100 Gbps ports meaning that they would have to perform upwards of 200 Million destination addresses lookups per second in the packet forwarding block that lies in the router ‘data-path’. At the same time, there is also a huge demand to make the network infrastructure more energy efficient. The work presented in this thesis is motivated by the observation that traditional synchronous digital systems will have increasing difficulty keeping up with these conflicting demands. Further, with reducing device geometries, extremes in “process, voltage and temperature” (PVT) variability will undermine reliable synchronous operation. It is expected that asynchronous design techniques will be able to overcome many of these problems and offer a means of lowering energy while maintaining high throughput and low latency. This thesis investigates existing address lookup algorithms and investigates the possibility of combining various approaches to improve energy efficiency without affecting lookup performance. A quasi delay-insensitive asynchronous methodology - Null Convention Logic (NCL) - is then applied to this combined design. Techniques that take advantage of the characteristics of the design methodology and the lookup algorithm to further improve the area, energy and latency characteristics are also analysed. The IP address lookup scheme utilised here is a recent algorithmic approach that uses compact binary-tries and was selected for its high memory efficiency and throughput. The design is pipelined, and the prefix information is stored in large RAMs. A Boolean synchronous implementation of the algorithm is simulated to provide an initial performance benchmark. It is observed that during the address lookup process nearly 68% of the trie accesses are to nodes that contained no prefix information. Bloom filter structures that use non-cryptographic hashes and single-bit memory are introduced into the address lookup process to prevent these unnecessary accesses, thereby reducing the energy consumption. Three non-cryptographic hashing algorithms (CRC32, Jenkins and Murmur) are also analysed for their suitability in Bloom filters, and the CRC32 is found to offer the most suitable trade-off between complexity and performance. As a first step to applying the NCL design methodology, NCL implementations of the hashing algorithms are created and evaluated. A significant finding from these experiments is that, unlike Boolean systems, latency and throughput in NCL systems are only loosely coupled. An example Jenkins hash implementation with eight pipeline stages and a cycle time of 3.2 ns exhibits a total latency of 6 ns, whereas an equivalent synchronous implementation with a similar clock period exhibits a latency of 25.6 ns. Further investigations reveal that completion detection circuits within the NCL pipelines impair throughput significantly. Two enhancements to the NCL circuit library aimed particularly at optimising NCL completion detection are proposed and analysed. These are shown to enable completion detection circuits to be built with the same delay but with 30% smaller area and about 75% lower peak current compared to the conventional approach using gates from the standard NCL library. An NCL SRAM structure is also proposed to augment the conventional 6-T cell array with circuits to generate the handshaking signals for managing the NCL data flow. Additionally, a dedicated column of cells called the Null-storage column is added, which indicates if a particular address in the RAM stores no Data, i.e., it is in its Null state. This additional hardware imposes a small area overhead of about 10% but allows accesses to Null locations to be completed in 50% less time and consume 40% less energy than accesses to valid Data locations. An experimental NCL-based address lookup system is then designed that includes all of the developed NCL modules. Statistical delay models derived from circuit-level simulations of individual modules are used to emulate realistic circuit delay variability in the behavioural modules written in Verilog. Simulations of the assembled system demonstrate that unlike what was observed with the synchronous design, with NCL, the design that does not employ Bloom filters, but only the Null-storage column RAMs for prefix storage, exhibits the smallest area on the chip and also consumes the least energy per address lookup. It is concluded that to derive maximum benefit out of an asynchronous design approach; it is necessary to carefully select the architectural blocks that combine the peculiarities of the implemented algorithm with the capabilities of the NCL design methodology

    Application Centric Networks-On-Chip Design Solutions for Future Multicore Systems

    Get PDF
    With advances in technology, future multicore systems scaled to 100s and 1000s of cores/accelerators are being touted as an effective solution for extracting huge performance gains using parallel programming paradigms. However with the failure of Dennard Scaling all the components on the chip cannot be run simultaneously without breaking the power and thermal constraints leading to strict chip power envelops. The scaling up of the number of on chip components has also brought upon Networks-On-Chip (NoC) based interconnect designs like 2D mesh. The contribution of NoC to the total on chip power and overall performance has been increasing steadily and hence high performance power-efficient NoC designs are becoming crucial. Future multicore paradigms can be broadly classified, based on the applications they are tailored to, into traditional Chip Multi processor(CMP) based application based systems, characterized by low core and NoC utilization, and emerging big data application based systems, characterized by large amounts of data movement necessitating high throughput requirements. To this order, we propose NoC design solutions for power-savings in future CMPs tailored to traditional applications and higher effective throughput gains in multicore systems tailored to bandwidth intensive applications. First, we propose Fly-over, a light-weight distributed mechanism for power-gating routers attached to switched off cores to reduce NoC power consumption in low load CMP environment. Secondly, we plan on utilizing a promising next generation memory technology, Spin-Transfer Torque Magnetic RAM(STT-MRAM), to achieve enhanced NoC performance to satisfy the high throughput demands in emerging bandwidth intensive applications, while reducing the power consumption simultaneously. Thirdly, we present a hardware data approximation framework for NoCs, APPROX-NoC, with an online data error control mechanism, which can leverage the approximate computing paradigm in the emerging data intensive big data applications to attain higher performance per watt

    Accelerating Time Series Analysis via Processing using Non-Volatile Memories

    Full text link
    Time Series Analysis (TSA) is a critical workload for consumer-facing devices. Accelerating TSA is vital for many domains as it enables the extraction of valuable information and predict future events. The state-of-the-art algorithm in TSA is the subsequence Dynamic Time Warping (sDTW) algorithm. However, sDTW's computation complexity increases quadratically with the time series' length, resulting in two performance implications. First, the amount of data parallelism available is significantly higher than the small number of processing units enabled by commodity systems (e.g., CPUs). Second, sDTW is bottlenecked by memory because it 1) has low arithmetic intensity and 2) incurs a large memory footprint. To tackle these two challenges, we leverage Processing-using-Memory (PuM) by performing in-situ computation where data resides, using the memory cells. PuM provides a promising solution to alleviate data movement bottlenecks and exposes immense parallelism. In this work, we present MATSA, the first MRAM-based Accelerator for Time Series Analysis. The key idea is to exploit magneto-resistive memory crossbars to enable energy-efficient and fast time series computation in memory. MATSA provides the following key benefits: 1) it leverages high levels of parallelism in the memory substrate by exploiting column-wise arithmetic operations, and 2) it significantly reduces the data movement costs performing computation using the memory cells. We evaluate three versions of MATSA to match the requirements of different environments (e.g., embedded, desktop, or HPC computing) based on MRAM technology trends. We perform a design space exploration and demonstrate that our HPC version of MATSA can improve performance by 7.35x/6.15x/6.31x and energy efficiency by 11.29x/4.21x/2.65x over server CPU, GPU and PNM architectures, respectively

    Branch Prediction For Network Processors

    Get PDF
    Originally designed to favour flexibility over packet processing performance, the future of the programmable network processor is challenged by the need to meet both increasing line rate as well as providing additional processing capabilities. To meet these requirements, trends within networking research has tended to focus on techniques such as offloading computation intensive tasks to dedicated hardware logic or through increased parallelism. While parallelism retains flexibility, challenges such as load-balancing limit its scope. On the other hand, hardware offloading allows complex algorithms to be implemented at high speed but sacrifice flexibility. To this end, the work in this thesis is focused on a more fundamental aspect of a network processor, the data-plane processing engine. Performing both system modelling and analysis of packet processing functions; the goal of this thesis is to identify and extract salient information regarding the performance of multi-processor workloads. Following on from a traditional software based analysis of programme workloads, we develop a method of modelling and analysing hardware accelerators when applied to network processors. Using this quantitative information, this thesis proposes an architecture which allows deeply pipelined micro-architectures to be implemented on the data-plane while reducing the branch penalty associated with these architectures
    corecore