7 research outputs found

    MLET: A Power Efficient Approach for TCAM Based, IP Lookup Engines in Internet Routers

    Full text link
    Routers are one of the important entities in computer networks specially the Internet. Forwarding IP packets is a valuable and vital function in Internet routers. Routers extract destination IP address from packets and lookup those addresses in their own routing table. This task is called IP lookup. Internet address lookup is a challenging problem due to the increasing routing table sizes. Ternary Content-Addressable Memories (TCAMs) are becoming very popular for designing high-throughput address lookup-engines on routers: they are fast, cost-effective and simple to manage. Despite the TCAMs speed, their high power consumption is their major drawback. In this paper, Multilevel Enabling Technique (MLET), a power efficient TCAM based hardware architecture has been proposed. This scheme is employed after an Espresso-II minimization algorithm to achieve lower power consumption. The performance evaluation of the proposed approach shows that it can save considerable amount of routing table's power consumption.Comment: 14 Pages, IJCNC 201

    Analog Content-Addressable Memory from Complementary FeFETs

    Full text link
    To address the increasing computational demands of artificial intelligence (AI) and big data, compute-in-memory (CIM) integrates memory and processing units into the same physical location, reducing the time and energy overhead of the system. Despite advancements in non-volatile memory (NVM) for matrix multiplication, other critical data-intensive operations, like parallel search, have been overlooked. Current parallel search architectures, namely content-addressable memory (CAM), often use binary, which restricts density and functionality. We present an analog CAM (ACAM) cell, built on two complementary ferroelectric field-effect transistors (FeFETs), that performs parallel search in the analog domain with over 40 distinct match windows. We then deploy it to calculate similarity between vectors, a building block in the following two machine learning problems. ACAM outperforms ternary CAM (TCAM) when applied to similarity search for few-shot learning on the Omniglot dataset, yielding projected simulation results with improved inference accuracy by 5%, 3x denser memory architecture, and more than 100x faster speed compared to central processing unit (CPU) and graphics processing unit (GPU) per similarity search on scaled CMOS nodes. We also demonstrate 1-step inference on a kernel regression model by combining non-linear kernel computation and matrix multiplication in ACAM, with simulation estimates indicating 1,000x faster inference than CPU and GPU

    IP Routing Table Compression Using TCAM and Distance-one Merge

    Get PDF
    In an attempt to slow the exhaustion of the Internet Protocol (IP) address space, Class-less Inter-Domain Routing (CIDR) was proposed and adopted. However, the decision to utilize CIDR also increases the size of the routing table, since it allows an arbitrary partitioning of the routing space. We propose a scheme to reduce the size of routing table in the CIDR context. Our approach utilizes a well-known and highly efficient heuristic to perform 2-level logic minimization in order to compress the routing table. By considering the IP routing table as a set of completely specified logic functions, we demonstrate that our technique can achieve about 25% reduction in the size of IP routing tables, while ensuring that our approach can handle routing table updates in real-time. The resulting routing table can be used with existing routers without needing any change in architecture. However, by realizing the IP routing table as proposed in this thesis, the implementation requires less complex hardware than Ternary CAM (TCAM) which are traditionally used to implement IP routing tables. The proposed architecture also reduces lookup latency by about 46%, hardware area by 9% and power consumed by 15% in contrast to a TCAM based implementation

    Projeto, implementação e avaliação do suporte de casamento com prefixo mais longo para IPv4/IPv6 em planos de dados programáveis multi-arquitetura

    Get PDF
    Orientador: Christian Rodolfo Esteve RothenbergDissertação (mestrado) - Universidade Estadual de Campinas, Faculdade de Engenharia Elétrica e de ComputaçãoResumo: Dentre as novas tendências em programação de dataplane dentro de SDN (Software Defined Networking) destacam-se os esforços para prover um suporte multi-plataforma dotado de alta definição das informações que são processadas pelo pipeline do plano de dados. No entanto, alguns desafios ainda persistem, como a necessidade de um plano de dados programável ou a adoção de uma abstração de programação independente de protocolo. Como forma de mitigar tais problemas, verifica-se que a Linguagem Específica de Domínio~(DSL) Programming Protocol-Independent Packet Processors~(P4) desponta como uma tendência emergente para expressar como os pacotes são processados pelo plano de dados de uma plataforma de rede programável. De modo independente e em paralelo, constata-se que o projeto OpenDataPlane~(ODP) cria um conjunto de plataformas abertas de Application Programming Interfaces~(APIs) projetado para o plano de dados de rede. Isso posto, tem-se que o Multi-Architecture Compiler System for Abstract Dataplanes~(MACSAD) surge como uma abordagem para convergir P4 e ODP em um processo de compilação convencional, arquivando a portabilidade dos aplicativos de plano de dados sem afetar as melhorias de desempenho do alvo. O MACSAD pode integrar a API do ODP e o P4, reunindo-os e definindo um plano de dados programável em um sistema de compilador unificado. Este trabalho tem como objetivo adicionar o suporte do Longest Prefix Match~(LPM) do IPv4/IPv6 ao MACSAD, integrado com as APIs do ODP e à programação P4, oferecendo recursos de planejamento de dados de alto desempenho. O suporte ao LPM proposto para o MACSAD combina o algoritmo de lookup e a biblioteca da API do ODP com o suporte à tabela MACSAD, para criar uma base de encaminhamento completa usada no processo do LPM. A implementação do IPv4 adapta o atual algoritmo de lookup do ODP para trabalhar com o MACSAD. A implementação de lookup IPv6, atualmente não suportada pelo ODP, é uma extensão do suporte IPv4 que é desenvolvido usando o mesmo algoritmo adaptado a uma chave de 128 bits. A pesquisa IPv4 e IPv6 usa uma base de árvore binária para executar o lookup do LPM. Para a avaliação de desempenho do suporte ao LPM, utilizamos uma ferramenta geradora de tráfego Network Function Performance Analyzer~(NFPA) que permite gerar diferentes tipos de tráfego no MACSAD. Cabe ainda destacar, como uma contribuição lateral deste trabalho, o desenvolvimento da ferramenta geradora de pacote BB-Gen, já com lançamento open source. Resultados experimentais mostram que é possível atingir um throughput de 10G com tamanhos de pacotes de 512 bytes ou superioresAbstract: New trends in dataplane programmability inside Software Defined Networking~(SDN) are in efforts to bring multi-platform support with a high definition of the information that is processed by the dataplane pipeline. However, some challenges are still present, as the necessity of a programmable dataplane or a protocol independent programming abstraction. The Programming Protocol-Independent Packet Processors~(P4) Domain Specific Language (DSL) is an emerging trend to express how the packets are processed by the dataplane of a programmable network platform. In parallel, OpenDataPlane~(ODP) project creates an open-source, cross-platform set of Application Programming Interfaces~(APIs) designed for the networking data plane. Multi-Architecture Compiler System for Abstract Dataplanes~(MACSAD) is an approach to converge P4 and ODP in a conventional compilation process, achieving portability of the dataplane applications without affecting the target performance improvements. MACSAD can integrate the ODP API and the P4, bringing them together and defining a programmable dataplane across multiple targets in a unified compiler system. This work aims at adding IPv4/IPv6 Longest Prefix Match~(LPM) support to MACSAD integrated with ODP APIs and P4 programmability delivering high-performance dataplane capabilities. The proposed LPM support for MACSAD combines the lookup algorithm and the ODP API library with MACSAD table support, to create a complete forwarding base used in the LPM process. The IPv4 implementation adapts the current ODP lookup algorithm to work with MACSAD. IPv6 lookup implementation, currently not supported by ODP, is an extension of the IPv4 support, developed using the same algorithm adapted to a 128-bit key. IPv4 and IPv6 lookup use a binary tree base, to perform the LPM lookup. For the performance evaluation of the LPM support, we use a traffic generator tool Network Function Performance Analyzer~(NFPA) that allows generating different types of traffic across MACSAD. A side contribution on this front we developed and released open source the BB-Gen packet crafter tool. Experimental results show that it is possible to reach a throughput of 10G with packets sizes of 512 Bytes and aboveMestradoEngenharia de ComputaçãoMestre em Engenharia Elétric

    Towards Terabit Carrier Ethernet and Energy Efficient Optical Transport Networks

    Get PDF

    A null convention logic based platform for high speed low energy IP packet forwarding

    Get PDF
    By 2020, it is predicted that there will be over 5 billion people and 38.5 billion Internet-ofThings devices on the Internet. The data generated by all these users and devices will have to be transported quickly and efficiently. Routers forming the backbone of this Internet already support multiple 100 Gbps ports meaning that they would have to perform upwards of 200 Million destination addresses lookups per second in the packet forwarding block that lies in the router ‘data-path’. At the same time, there is also a huge demand to make the network infrastructure more energy efficient. The work presented in this thesis is motivated by the observation that traditional synchronous digital systems will have increasing difficulty keeping up with these conflicting demands. Further, with reducing device geometries, extremes in “process, voltage and temperature” (PVT) variability will undermine reliable synchronous operation. It is expected that asynchronous design techniques will be able to overcome many of these problems and offer a means of lowering energy while maintaining high throughput and low latency. This thesis investigates existing address lookup algorithms and investigates the possibility of combining various approaches to improve energy efficiency without affecting lookup performance. A quasi delay-insensitive asynchronous methodology - Null Convention Logic (NCL) - is then applied to this combined design. Techniques that take advantage of the characteristics of the design methodology and the lookup algorithm to further improve the area, energy and latency characteristics are also analysed. The IP address lookup scheme utilised here is a recent algorithmic approach that uses compact binary-tries and was selected for its high memory efficiency and throughput. The design is pipelined, and the prefix information is stored in large RAMs. A Boolean synchronous implementation of the algorithm is simulated to provide an initial performance benchmark. It is observed that during the address lookup process nearly 68% of the trie accesses are to nodes that contained no prefix information. Bloom filter structures that use non-cryptographic hashes and single-bit memory are introduced into the address lookup process to prevent these unnecessary accesses, thereby reducing the energy consumption. Three non-cryptographic hashing algorithms (CRC32, Jenkins and Murmur) are also analysed for their suitability in Bloom filters, and the CRC32 is found to offer the most suitable trade-off between complexity and performance. As a first step to applying the NCL design methodology, NCL implementations of the hashing algorithms are created and evaluated. A significant finding from these experiments is that, unlike Boolean systems, latency and throughput in NCL systems are only loosely coupled. An example Jenkins hash implementation with eight pipeline stages and a cycle time of 3.2 ns exhibits a total latency of 6 ns, whereas an equivalent synchronous implementation with a similar clock period exhibits a latency of 25.6 ns. Further investigations reveal that completion detection circuits within the NCL pipelines impair throughput significantly. Two enhancements to the NCL circuit library aimed particularly at optimising NCL completion detection are proposed and analysed. These are shown to enable completion detection circuits to be built with the same delay but with 30% smaller area and about 75% lower peak current compared to the conventional approach using gates from the standard NCL library. An NCL SRAM structure is also proposed to augment the conventional 6-T cell array with circuits to generate the handshaking signals for managing the NCL data flow. Additionally, a dedicated column of cells called the Null-storage column is added, which indicates if a particular address in the RAM stores no Data, i.e., it is in its Null state. This additional hardware imposes a small area overhead of about 10% but allows accesses to Null locations to be completed in 50% less time and consume 40% less energy than accesses to valid Data locations. An experimental NCL-based address lookup system is then designed that includes all of the developed NCL modules. Statistical delay models derived from circuit-level simulations of individual modules are used to emulate realistic circuit delay variability in the behavioural modules written in Verilog. Simulations of the assembled system demonstrate that unlike what was observed with the synchronous design, with NCL, the design that does not employ Bloom filters, but only the Null-storage column RAMs for prefix storage, exhibits the smallest area on the chip and also consumes the least energy per address lookup. It is concluded that to derive maximum benefit out of an asynchronous design approach; it is necessary to carefully select the architectural blocks that combine the peculiarities of the implemented algorithm with the capabilities of the NCL design methodology
    corecore