4 research outputs found

    Algoritmos paralelos e eficientes para consultas IP no Intel(R) Xeon Phi(tm) e CPUs Multi-Core

    Get PDF
    Dissertação (mestrado)—Universidade de Brasília, Instituto de Ciências Exatas, Departamento de Ciência da Computação, 2017.Roteadores em software são uma solução promissora para lidar com o encaminhamento de pacotes devido ao seu bom custo-benefício e flexibilidade. Contudo, é desafiador o desenvolvimento de roteadores em software capazes de atingir as taxas de encaminhamento de pacotes necessárias. O uso de sistemas e técnicas de computação paralela pode ser uma abordagem viável para melhorar o desempenho dessas soluções. A fase de consulta IP constitui uma operação central no encaminhamento de pacotes, que é implementada através de um algoritmo de Casamento de Maior Prefixo (CMP). Assim, este trabalho propõe e avalia o uso de técnicas e processadores paralelos no desenvolvimento de um algoritmo otimizado que emprega filtros de Bloom (BFs) e tabelas hash para a execução de consultas IP. Especificamente, tem-se como alvo a implementação desse algoritmo no coprocessador many-core Intel® Xeon Phi™ (Intel Phi), mas também avalia-se o seu desempenho em CPUs multi-core e em um modelo de execução cooperativa que usa ambos os processadores com várias otimizações. Os resultados experimentais mostram que foi possível atingir altas taxas de consultas IP — até 182,7 Mlps (milhões de pacotes por segundo) ou 119,9 Gbps para pacotes IPv6 de 84B — em um único Intel Phi. Este desempenho indica que o Intel Phi é uma plataforma promissora para a implantação de algoritmos de consultas IP. Além disso, comparou-se o desempenho do algoritmo BFs com uma abordagem eficiente baseada na Multi-Index Hybrid Trie (MIHT), na qual o algoritmo BFs foi até 5,39x mais rápido. Esta comparação mostra que o algoritmo sequencial mais eficiente pode não ser a melhor opção em uma configuração paralela. Alternativamente, é necessário avaliar as características dos processadores, as demandas de computação/dados dos algoritmos e as estruturas de dados empregadas para analisar como os algoritmos podem se beneficiar de um dispositivo de computação paralelo, potenciais limitações na escalabilidade e oportunidades de otimização. Estas descobertas também são importantes para novos esforços no desenvolvimento de algoritmos nessa área, os quais têm sido, em sua maioria, focados em soluções sequenciais.Software routers are a promising solution to deal with packet forwarding because of their good cost benefit and flexibility. However, it is challenging to develop software routers that can attain the required packet forwarding rates. The use of parallel computing systems and techniques may be a viable approach to improve the performance of these solutions. The IP lookup phase is a core operation in packet forwarding, which is implemented via a Longest Prefix Matching (LPM) algorithm to find the next hop address for every input packet. Therefore, this work proposes and evaluates the use of parallel processors and techniques in the development of an optimized algorithm that employs Bloom filters (BFs) and hash tables to the IP lookup problem. Specifically, we target the implementation on the Intel® Xeon Phi™ (Intel Phi) many-core coprocessor, but we also evaluate its performance on multi-core CPUs and on a cooperative execution model that uses both processors with several optimizations. The experimental results show that we were able to attain high IP lookup throughputs — up to 182.7 Mlps (million packets per second) or 119.9 Gbps for 84B IPv6 packets — on a single Intel Phi. This performance indicates that the Intel Phi is a very promising platform for deployment of IP lookup algorithms. We have also compared the BFs algorithm to an efficient approach based on the Multi-Index Hybrid Trie (MIHT) in which the BFs algorithm was up to 5.39x faster. This comparison shows that the most efficient sequential algorithm may not be the best option in a parallel setting. Instead, it is necessary to evaluate the processors characteristics, algorithms compute/data demands, and data structures employed to analyze how the algorithms will benefit from parallel computing devices, potential limitations on scalability and opportunities for optimizations. These findings are also important to new efforts in algorithmic developments in the topic, which have been highly focused on sequential solutions

    Dynamic Tree Bitmap For IP Lookup and Update

    No full text
    We propose a data structure–dynamic tree bitmap–for the representation of dynamic IP router tables that must support very high lookup and update rates. In fact, the dynamic tree bitmap is able to support updates at the same rate as lookups and is very competitive with other structures–tree bitmap and BaRT–proposed earlier for dynamic tables. Although the dynamic tree bitmap requires more memory than is required by the tree bitmap and BaRT structures, the required memory remains reasonable. The real value of our structure is its ability to support a very high update rate

    High-performance software packet processing

    Full text link
    In today’s Internet, it is highly desirable to have fast and scalable software packet processing solutions for network applications that run on commodity hardware. The advent of cloud computing drives the continued rapid growth of Internet traffic. Moreover, the development of emerging networking techniques, such as Network Function Virtualization, significantly shapes the need for implementing the network functions in software. Finally, with the advancement of modern platforms as well as software frameworks for packet processing, network applications have potential to process 100+ Gbps network traffic on a single commodity server. Representative frameworks include the Click modular router, the RouteBricks scalable routing architecture, and BUFFALO, the software-based Ethernet switch. Beneath this general-purpose routing and switching functionality lie a broad set of network applications, many of which are handled with custom methods to provide cost-effectiveness and flexibility. This thesis considers two long-standing networking applications, IP lookup and distributed denial-of-service (DDoS) mitigation, and proposes efficient software-based methods drawing from this new perspective. In this thesis, we first introduce several optimization techniques to accelerate network applications by taking advantage of modern CPU features. Then, we explore the IP lookup problem to find the longest matching prefix of an IP address in a set of prefixes. An ideal IP lookup algorithm should achieve small constant IP lookup time, and on-chip memory usage. However, no prior IP lookup algorithm achieves both requirements at the same time. We propose SAIL, a splitting approach to IP lookup, and a suite of algorithms for IP lookup based on SAIL framework. We conducted extensive experiments to evaluate our algorithms, and experimental results show that our SAIL algorithms are much faster than well-known IP lookup algorithms. Next, we switch our focus to DDoS, an attempt to disrupt the legitimate traffic of a victim by sending a flood of Internet traffic from different sources. Our solution is Gatekeeper, the first open-source and deployable DDoS mitigation system. We present a series of optimization techniques, including use of modern platforms, group prefetching, coroutines, and hashing, to accelerate Gatekeeper. Experimental results show that these optimization techniques significantly improve its performance over alternative baseline solutions.2022-01-30T00:00:00

    A null convention logic based platform for high speed low energy IP packet forwarding

    Get PDF
    By 2020, it is predicted that there will be over 5 billion people and 38.5 billion Internet-ofThings devices on the Internet. The data generated by all these users and devices will have to be transported quickly and efficiently. Routers forming the backbone of this Internet already support multiple 100 Gbps ports meaning that they would have to perform upwards of 200 Million destination addresses lookups per second in the packet forwarding block that lies in the router ‘data-path’. At the same time, there is also a huge demand to make the network infrastructure more energy efficient. The work presented in this thesis is motivated by the observation that traditional synchronous digital systems will have increasing difficulty keeping up with these conflicting demands. Further, with reducing device geometries, extremes in “process, voltage and temperature” (PVT) variability will undermine reliable synchronous operation. It is expected that asynchronous design techniques will be able to overcome many of these problems and offer a means of lowering energy while maintaining high throughput and low latency. This thesis investigates existing address lookup algorithms and investigates the possibility of combining various approaches to improve energy efficiency without affecting lookup performance. A quasi delay-insensitive asynchronous methodology - Null Convention Logic (NCL) - is then applied to this combined design. Techniques that take advantage of the characteristics of the design methodology and the lookup algorithm to further improve the area, energy and latency characteristics are also analysed. The IP address lookup scheme utilised here is a recent algorithmic approach that uses compact binary-tries and was selected for its high memory efficiency and throughput. The design is pipelined, and the prefix information is stored in large RAMs. A Boolean synchronous implementation of the algorithm is simulated to provide an initial performance benchmark. It is observed that during the address lookup process nearly 68% of the trie accesses are to nodes that contained no prefix information. Bloom filter structures that use non-cryptographic hashes and single-bit memory are introduced into the address lookup process to prevent these unnecessary accesses, thereby reducing the energy consumption. Three non-cryptographic hashing algorithms (CRC32, Jenkins and Murmur) are also analysed for their suitability in Bloom filters, and the CRC32 is found to offer the most suitable trade-off between complexity and performance. As a first step to applying the NCL design methodology, NCL implementations of the hashing algorithms are created and evaluated. A significant finding from these experiments is that, unlike Boolean systems, latency and throughput in NCL systems are only loosely coupled. An example Jenkins hash implementation with eight pipeline stages and a cycle time of 3.2 ns exhibits a total latency of 6 ns, whereas an equivalent synchronous implementation with a similar clock period exhibits a latency of 25.6 ns. Further investigations reveal that completion detection circuits within the NCL pipelines impair throughput significantly. Two enhancements to the NCL circuit library aimed particularly at optimising NCL completion detection are proposed and analysed. These are shown to enable completion detection circuits to be built with the same delay but with 30% smaller area and about 75% lower peak current compared to the conventional approach using gates from the standard NCL library. An NCL SRAM structure is also proposed to augment the conventional 6-T cell array with circuits to generate the handshaking signals for managing the NCL data flow. Additionally, a dedicated column of cells called the Null-storage column is added, which indicates if a particular address in the RAM stores no Data, i.e., it is in its Null state. This additional hardware imposes a small area overhead of about 10% but allows accesses to Null locations to be completed in 50% less time and consume 40% less energy than accesses to valid Data locations. An experimental NCL-based address lookup system is then designed that includes all of the developed NCL modules. Statistical delay models derived from circuit-level simulations of individual modules are used to emulate realistic circuit delay variability in the behavioural modules written in Verilog. Simulations of the assembled system demonstrate that unlike what was observed with the synchronous design, with NCL, the design that does not employ Bloom filters, but only the Null-storage column RAMs for prefix storage, exhibits the smallest area on the chip and also consumes the least energy per address lookup. It is concluded that to derive maximum benefit out of an asynchronous design approach; it is necessary to carefully select the architectural blocks that combine the peculiarities of the implemented algorithm with the capabilities of the NCL design methodology
    corecore