7 research outputs found
MLET: A Power Efficient Approach for TCAM Based, IP Lookup Engines in Internet Routers
Routers are one of the important entities in computer networks specially the
Internet. Forwarding IP packets is a valuable and vital function in Internet
routers. Routers extract destination IP address from packets and lookup those
addresses in their own routing table. This task is called IP lookup. Internet
address lookup is a challenging problem due to the increasing routing table
sizes. Ternary Content-Addressable Memories (TCAMs) are becoming very popular
for designing high-throughput address lookup-engines on routers: they are fast,
cost-effective and simple to manage. Despite the TCAMs speed, their high power
consumption is their major drawback. In this paper, Multilevel Enabling
Technique (MLET), a power efficient TCAM based hardware architecture has been
proposed. This scheme is employed after an Espresso-II minimization algorithm
to achieve lower power consumption. The performance evaluation of the proposed
approach shows that it can save considerable amount of routing table's power
consumption.Comment: 14 Pages, IJCNC 201
Analog Content-Addressable Memory from Complementary FeFETs
To address the increasing computational demands of artificial intelligence
(AI) and big data, compute-in-memory (CIM) integrates memory and processing
units into the same physical location, reducing the time and energy overhead of
the system. Despite advancements in non-volatile memory (NVM) for matrix
multiplication, other critical data-intensive operations, like parallel search,
have been overlooked. Current parallel search architectures, namely
content-addressable memory (CAM), often use binary, which restricts density and
functionality. We present an analog CAM (ACAM) cell, built on two complementary
ferroelectric field-effect transistors (FeFETs), that performs parallel search
in the analog domain with over 40 distinct match windows. We then deploy it to
calculate similarity between vectors, a building block in the following two
machine learning problems. ACAM outperforms ternary CAM (TCAM) when applied to
similarity search for few-shot learning on the Omniglot dataset, yielding
projected simulation results with improved inference accuracy by 5%, 3x denser
memory architecture, and more than 100x faster speed compared to central
processing unit (CPU) and graphics processing unit (GPU) per similarity search
on scaled CMOS nodes. We also demonstrate 1-step inference on a kernel
regression model by combining non-linear kernel computation and matrix
multiplication in ACAM, with simulation estimates indicating 1,000x faster
inference than CPU and GPU
IP Routing Table Compression Using TCAM and Distance-one Merge
In an attempt to slow the exhaustion of the Internet Protocol (IP) address space,
Class-less Inter-Domain Routing (CIDR) was proposed and adopted. However, the
decision to utilize CIDR also increases the size of the routing table, since it allows
an arbitrary partitioning of the routing space. We propose a scheme to reduce the
size of routing table in the CIDR context. Our approach utilizes a well-known and
highly efficient heuristic to perform 2-level logic minimization in order to compress
the routing table. By considering the IP routing table as a set of completely specified
logic functions, we demonstrate that our technique can achieve about 25% reduction
in the size of IP routing tables, while ensuring that our approach can handle routing
table updates in real-time. The resulting routing table can be used with existing
routers without needing any change in architecture. However, by realizing the IP
routing table as proposed in this thesis, the implementation requires less complex
hardware than Ternary CAM (TCAM) which are traditionally used to implement IP
routing tables. The proposed architecture also reduces lookup latency by about 46%,
hardware area by 9% and power consumed by 15% in contrast to a TCAM based
implementation
Projeto, implementação e avaliação do suporte de casamento com prefixo mais longo para IPv4/IPv6 em planos de dados programáveis multi-arquitetura
Orientador: Christian Rodolfo Esteve RothenbergDissertação (mestrado) - Universidade Estadual de Campinas, Faculdade de Engenharia Elétrica e de ComputaçãoResumo: Dentre as novas tendências em programação de dataplane dentro de SDN (Software Defined Networking) destacam-se os esforços para prover um suporte multi-plataforma dotado de alta definição das informações que são processadas pelo pipeline do plano de dados. No entanto, alguns desafios ainda persistem, como a necessidade de um plano de dados programável ou a adoção de uma abstração de programação independente de protocolo. Como forma de mitigar tais problemas, verifica-se que a Linguagem Específica de Domínio~(DSL) Programming Protocol-Independent Packet Processors~(P4) desponta como uma tendência emergente para expressar como os pacotes são processados pelo plano de dados de uma plataforma de rede programável. De modo independente e em paralelo, constata-se que o projeto OpenDataPlane~(ODP) cria um conjunto de plataformas abertas de Application Programming Interfaces~(APIs) projetado para o plano de dados de rede. Isso posto, tem-se que o Multi-Architecture Compiler System for Abstract Dataplanes~(MACSAD) surge como uma abordagem para convergir P4 e ODP em um processo de compilação convencional, arquivando a portabilidade dos aplicativos de plano de dados sem afetar as melhorias de desempenho do alvo. O MACSAD pode integrar a API do ODP e o P4, reunindo-os e definindo um plano de dados programável em um sistema de compilador unificado. Este trabalho tem como objetivo adicionar o suporte do Longest Prefix Match~(LPM) do IPv4/IPv6 ao MACSAD, integrado com as APIs do ODP e à programação P4, oferecendo recursos de planejamento de dados de alto desempenho. O suporte ao LPM proposto para o MACSAD combina o algoritmo de lookup e a biblioteca da API do ODP com o suporte à tabela MACSAD, para criar uma base de encaminhamento completa usada no processo do LPM. A implementação do IPv4 adapta o atual algoritmo de lookup do ODP para trabalhar com o MACSAD. A implementação de lookup IPv6, atualmente não suportada pelo ODP, é uma extensão do suporte IPv4 que é desenvolvido usando o mesmo algoritmo adaptado a uma chave de 128 bits. A pesquisa IPv4 e IPv6 usa uma base de árvore binária para executar o lookup do LPM. Para a avaliação de desempenho do suporte ao LPM, utilizamos uma ferramenta geradora de tráfego Network Function Performance Analyzer~(NFPA) que permite gerar diferentes tipos de tráfego no MACSAD. Cabe ainda destacar, como uma contribuição lateral deste trabalho, o desenvolvimento da ferramenta geradora de pacote BB-Gen, já com lançamento open source. Resultados experimentais mostram que é possível atingir um throughput de 10G com tamanhos de pacotes de 512 bytes ou superioresAbstract: New trends in dataplane programmability inside Software Defined Networking~(SDN) are in efforts to bring multi-platform support with a high definition of the information that is processed by the dataplane pipeline. However, some challenges are still present, as the necessity of a programmable dataplane or a protocol independent programming abstraction. The Programming Protocol-Independent Packet Processors~(P4) Domain Specific Language (DSL) is an emerging trend to express how the packets are processed by the dataplane of a programmable network platform. In parallel, OpenDataPlane~(ODP) project creates an open-source, cross-platform set of Application Programming Interfaces~(APIs) designed for the networking data plane. Multi-Architecture Compiler System for Abstract Dataplanes~(MACSAD) is an approach to converge P4 and ODP in a conventional compilation process, achieving portability of the dataplane applications without affecting the target performance improvements. MACSAD can integrate the ODP API and the P4, bringing them together and defining a programmable dataplane across multiple targets in a unified compiler system. This work aims at adding IPv4/IPv6 Longest Prefix Match~(LPM) support to MACSAD integrated with ODP APIs and P4 programmability delivering high-performance dataplane capabilities. The proposed LPM support for MACSAD combines the lookup algorithm and the ODP API library with MACSAD table support, to create a complete forwarding base used in the LPM process. The IPv4 implementation adapts the current ODP lookup algorithm to work with MACSAD. IPv6 lookup implementation, currently not supported by ODP, is an extension of the IPv4 support, developed using the same algorithm adapted to a 128-bit key. IPv4 and IPv6 lookup use a binary tree base, to perform the LPM lookup. For the performance evaluation of the LPM support, we use a traffic generator tool Network Function Performance Analyzer~(NFPA) that allows generating different types of traffic across MACSAD. A side contribution on this front we developed and released open source the BB-Gen packet crafter tool. Experimental results show that it is possible to reach a throughput of 10G with packets sizes of 512 Bytes and aboveMestradoEngenharia de ComputaçãoMestre em Engenharia Elétric
A null convention logic based platform for high speed low energy IP packet forwarding
By 2020, it is predicted that there will be over 5 billion people and 38.5 billion Internet-ofThings devices on the Internet. The data generated by all these users and devices will have to be transported quickly and efficiently. Routers forming the backbone of this Internet already support multiple 100 Gbps ports meaning that they would have to perform upwards of 200 Million destination addresses lookups per second in the packet forwarding block that lies in the router ‘data-path’. At the same time, there is also a huge demand to make the network infrastructure more energy efficient. The work presented in this thesis is motivated by the observation that traditional synchronous digital systems will have increasing difficulty keeping up with these conflicting demands. Further, with reducing device geometries, extremes in “process, voltage and temperature” (PVT) variability will undermine reliable synchronous operation. It is expected that asynchronous design techniques will be able to overcome many of these problems and offer a means of lowering energy while maintaining high throughput and low latency. This thesis investigates existing address lookup algorithms and investigates the possibility of combining various approaches to improve energy efficiency without affecting lookup performance. A quasi delay-insensitive asynchronous methodology - Null Convention Logic (NCL) - is then applied to this combined design. Techniques that take advantage of the characteristics of the design methodology and the lookup algorithm to further improve the area, energy and latency characteristics are also analysed. The IP address lookup scheme utilised here is a recent algorithmic approach that uses compact binary-tries and was selected for its high memory efficiency and throughput. The design is pipelined, and the prefix information is stored in large RAMs. A Boolean synchronous implementation of the algorithm is simulated to provide an initial performance benchmark. It is observed that during the address lookup process nearly 68% of the trie accesses are to nodes that contained no prefix information. Bloom filter structures that use non-cryptographic hashes and single-bit memory are introduced into the address lookup process to prevent these unnecessary accesses, thereby reducing the energy consumption. Three non-cryptographic hashing algorithms (CRC32, Jenkins and Murmur) are also analysed for their suitability in Bloom filters, and the CRC32 is found to offer the most suitable trade-off between complexity and performance. As a first step to applying the NCL design methodology, NCL implementations of the hashing algorithms are created and evaluated. A significant finding from these experiments is that, unlike Boolean systems, latency and throughput in NCL systems are only loosely coupled. An example Jenkins hash implementation with eight pipeline stages and a cycle time of 3.2 ns exhibits a total latency of 6 ns, whereas an equivalent synchronous implementation with a similar clock period exhibits a latency of 25.6 ns. Further investigations reveal that completion detection circuits within the NCL pipelines impair throughput significantly. Two enhancements to the NCL circuit library aimed particularly at optimising NCL completion detection are proposed and analysed. These are shown to enable completion detection circuits to be built with the same delay but with 30% smaller area and about 75% lower peak current compared to the conventional approach using gates from the standard NCL library. An NCL SRAM structure is also proposed to augment the conventional 6-T cell array with circuits to generate the handshaking signals for managing the NCL data flow. Additionally, a dedicated column of cells called the Null-storage column is added, which indicates if a particular address in the RAM stores no Data, i.e., it is in its Null state. This additional hardware imposes a small area overhead of about 10% but allows accesses to Null locations to be completed in 50% less time and consume 40% less energy than accesses to valid Data locations. An experimental NCL-based address lookup system is then designed that includes all of the developed NCL modules. Statistical delay models derived from circuit-level simulations of individual modules are used to emulate realistic circuit delay variability in the behavioural modules written in Verilog. Simulations of the assembled system demonstrate that unlike what was observed with the synchronous design, with NCL, the design that does not employ Bloom filters, but only the Null-storage column RAMs for prefix storage, exhibits the smallest area on the chip and also consumes the least energy per address lookup. It is concluded that to derive maximum benefit out of an asynchronous design approach; it is necessary to carefully select the architectural blocks that combine the peculiarities of the implemented algorithm with the capabilities of the NCL design methodology