250 research outputs found

    ADVANCED HASHING SCHEMES FOR PACKETFORWARDING USING SET ASSOCIATIVEMEMORY ARCHITECTURES

    Get PDF
    Building a high performance IP packet forwarding (PF) engine remains a challenge due to increasingly stringent throughput requirements and the growing sizes of IP forwarding tables.The router has to match the incoming packet's IP address against the forwarding table.The matching process has to be done in wire speed which is why scalability and low power consumption are features that PF engines must maintain.It is common for PF engines to use hash tables; however, the classic hashing downsides have to be dealt with (e.g., collisions, worst case memory access time, ... etc.).While open addressing hash tables, in general, provide good average case search performance, their memory utilization and worst case performance can degrade quickly due to collisions that leads to bucket overflows.Set associative memory can be used for hardware implementations of hash tables with the property that each bucket of a hash table can be searched in one memory cycle.Hence, PF engine architectures based on associative memory will outperform those based on the conventional Ternary Content Addressable Memory (TCAM) in terms of power and scalability.The two standard solutions to the overflow problem are either to use some sort of predefined probing (e.g., linear or quadratic) or to use multiple hash functions.This work presents two new hash schemes that extend both aforementioned solutions to tackle the overflow problem efficiently.The first scheme is a hash probing scheme that is called Content-based HAsh Probing, or CHAP.CHAP is a probing scheme that is based on the content of the hash table to avoid the classical side effects of predefined hash probing methods (i.e., primary and secondary clustering phenomena) and at the same time reduces the overflow.The second scheme, called Progressive Hashing, or PH, is a general multiple hash scheme that reduces the overflow as well.PH splits the prefixes into groups where each group is assigned one hash function, then reuse some hash functions in a progressive fashion to reduce the overflow.We show by experimenting with real IP lookup tables that both schemes outperform other hashing schemes

    High-Performance Packet Processing Engines Using Set-Associative Memory Architectures

    Get PDF
    The emergence of new optical transmission technologies has led to ultra-high Giga bits per second (Gbps) link speeds. In addition, the switch from 32-bit long IPv4 addresses to the 128-bit long IPv6 addresses is currently progressing. Both factors make it hard for new Internet routers and firewalls to keep up with wire-speed packet-processing. By packet-processing we mean three applications: packet forwarding, packet classification and deep packet inspection. In packet forwarding (PF), the router has to match the incoming packet's IP address against the forwarding table. It then directs each packet to its next hop toward its final destination. A packet classification (PC) engine examines a packet header by matching it against a database of rules, or filters, to obtain the best matching rule. Rules are associated with either an ``action'' (e.g., firewall) or a ``flow ID'' (e.g., quality of service or QoS). The last application is deep packet inspection (DPI) where the firewall has to inspect the actual packet payload for malware or network attacks. In this case, the payload is scanned against a database of rules, where each rule is either a plain text string or a regular expression. In this thesis, we introduce a family of hardware solutions that combine the above requirements. These solutions rely on a set-associative memory architecture that is called CA-RAM (Content Addressable-Random Access Memory). CA-RAM is a hardware implementation of hash tables with the property that each bucket of a hash table can be searched in one memory cycle. However, the classic hashing downsides have to be dealt with, such as collisions that lead to overflow and worst-case memory access time. The two standard solutions to the overflow problem are either to use some predefined probing (e.g., linear or quadratic) or to use multiple hash functions. We present new hash schemes that extend both aforementioned solutions to tackle the overflow problem efficiently. We show by experimenting with real IP lookup tables, synthetic packet classification rule sets and real DPI databases that our schemes outperform other previously proposed schemes

    A Survey of Hashing Techniques for High Performance Computing

    Get PDF
    Hashing is a well-known and widely used technique for providing O(1) access to large files on secondary storage and tables in memory. Hashing techniques were introduced in the early 60s. The term hash function historically is used to denote a function that compresses a string of arbitrary input to a string of fixed length. Hashing finds applications in other fields such as fuzzy matching, error checking, authentication, cryptography, and networking. Hashing techniques have found application to provide faster access in routing tables, with the increase in the size of the routing tables. More recently, hashing has found applications in transactional memory in hardware. Motivated by these newly emerged applications of hashing, in this paper we present a survey of hashing techniques starting from traditional hashing methods with greater emphasis on the recent developments. We provide a brief explanation on hardware hashing and a brief introduction to transactional memory

    Design of Special Function Units in Modern Microprocessors

    Get PDF
    Today’s computing systems demand high performance for applications such as cloud computing, web-based search engines, network applications, and social media tasks. Such software applications involve an extensive use of hashing and arithmetic operations in their computation. In this thesis, we explore the use of new special function units (SFUs) for modern microprocessors, to accelerate such workloads. First, we design an SFU for hashing. Hashing can reduce the complexity of search and lookup from O(p) to O(p/n), where n bins are used and p items are being processed. In modern microprocessors, hashing is done in software. In our work, we propose a novel hardware hash unit design for use in modern microprocessors. Since the hash unit is designed at the hardware level, several advantages are obtained by our approach. First, a hardware-based hash unit executes a single hash instruction to perform a hash operation. In a software-based hashing in modern microprocessors, a hash operation is compiled into multiple instructions, thereby degrading performance. Second, software-based hashing stores hash data in a DRAM (also, hash operation entries can be stored in one of the cache levels). In a hardware-based hash unit, hash data is stored in a dedicated memory module (a hardware hash table), which improves performance. Third, today’s operating systems execute multiple applications (processes) in parallel, which entail high memory utilization. Hence the operating systems require many context switching between different processes, which results in many cache misses. In a hardware-based hash unit, the cache misses is reduced significantly using the dedicated memory module (hash table). These advantages all reduce the power consumption and increase the overall system performance significantly with a minimal increase in the microprocessor’s die area. We evaluate our hardware-based hash unit and compare its performance with software-based hashing. We start by evaluating our design approach at the micro-architecture level in terms of system performance. After that, we design our approach at the circuit level design to obtain the area overhead. Also, we analyze our design’s power and delay for each hash operation. These results are compared with a traditional hashing implementation. Then, we present an FPGA-based coprocessor for hash unit acceleration, applied to a virus checking application. Second, we present an SFU to speed up arithmetic operations. We call this arithmetic SFU a programmable arithmetic unit (PAU). In modern microprocessors, applications that require heavy arithmetic computations are done in software. To improve the performance for such computations, we present a programmable arithmetic unit (PAU), a partially reconfigurable methodology for arithmetic applications. The PAU consists of a set of IP blocks connected to a reconfigurable FPGA controller via a fast mesh-based interconnect. The IP blocks in the PAU can be any IP block such as adders, subtractors, multipliers, comparators and sign extension units. The PAU can have one or more copies of the same IP block (for example, 5 adders and 7 multipliers). The FPGA controller is an on-chip FPGA-based reconfigurable control fabric. The FPGA controller enables different arithmetic applications to be embedded on the PAU. The FPGA controller is programmed for different applications. The reconfigurable logic is based on a LUT-based design like a traditional FPGA. The FPGA controller and the IP blocks in the PAU communicate via a high speed ring data fabric. In our work, we use the PAU as an SFU in modern microprocessors. We compare the performance of different hardware-based arithmetic applications in the PAU with software-based implementations in modern microprocessors

    Sistema de inspeção superficial de pacotes de alto desempenho para identificação de tráfego

    Get PDF
    Mestrado em Engenharia de Computadores e TelemáticaThe evolution and growth of the Internet has led to a growing preoccupation regarding dynamic allocation of resources in large networks, as well as to an unprecedented growing adoption of security policies based on tra c classi cation. This phenomenon triggered the creation of deep inspection mechanisms for packets where we can see a cross-access that is based on the retrieval of speci c strings present in packet's Payload. This event raises a number of technical, ethical, and potentially legal limitations. With the increasing need to develop less invasive and more e cient inspection mechanisms, in terms of processing speed and potentially, memory management, the scienti c community began working in other types of approaches to solve the problem. In this dissertation, we propose a tra c ow classi cation system based on Shallow packet inspection. Given the latest forecasts and current statistical data, which estimates that about 90 % of all tra c will be video in the next few years, we have decided to devote special attention to this speci c type. For this, we proceeded to collect non-sensitive information, with which we perform a statistical study based on low-level statistics. The results obtained from this study were analysed from a behavioural point of view, in order to reach the extraction of coherent rules that allow the di erentiation of independent types of tra c. Finally, we studied, conceived and test an e cient ow organisation paradigm. The system has been tested and evaluated using packet ood tests. Following to the measurement and examination of results in terms of processing times as well as the use of main memory.A evolução e crescimento da Internet tem levado a uma crescente preocupação tendo em vista a alocação dinâmica de recursos em redes de grande dimensão, assim como uma adopção sem precedente de politicas de segurança baseadas em classi ficação de tráfego. Este fenómeno desencadeou a criação de mecanismos de inspecção profunda de pacotes onde se assiste a um acesso transversal, que assenta na obtenção de sequências de bytes especificas, presentes no Payload de cada pacote, o que levanta uma série de limitações técnicas, éticas e potencialmente legais. Com a crescente necessidade de desenvolvimento de mecanismos de inspecção menos invasivos e mais e cientes em termos de velocidade e potencialmente gestão de memória, a comunidade cientifi ca começou a trabalhar em outros tipos de abordagem ao problema. Nesta dissertação, propomos um sistema de classi cação de fluxos de trafego que assenta em Shallow packet inspection. Tendo em conta as ultimas previsões e dados estatísticos atuais, que estimam que cerca de 90% de todo tráfego na Internet, seja do tipo vídeo nos próximos anos, decidimos dedicar especial atenção sobre esse tipo especifico. Para isso, procedemos a recolha de informação não sensível, com a qual efetuamos um estudo estatístico baseado em estatísticas de baixo nivel. Os resultados obtidos nesse estudo, foram analisados de um ponto de vista comportamental, por forma a alcançar uma prova de conceito na extracção de regras coerentes que permitam diferenciar tipos de tráfego independentes. Por fim, estudamos, concebemos e testamos um paradigma de organizaçao de fluxos de forma e ciente. O sistema foi testado e avaliado recorrendo a testes de inundação por pacotes, seguidos da medição e avaliação dos resultados em termos de tempo de processamento, assim como, ao uso de memoria principal

    Architectural Enhancements for Data Transport in Datacenter Systems

    Full text link
    Datacenter systems run myriad applications, which frequently communicate with each other and/or Input/Output (I/O) devices—including network adapters, storage devices, and accelerators. Due to the growing speed of I/O devices and the emergence of microservice-based programming models, the I/O software stacks have become a critical factor in end-to-end communication performance. As such, I/O software stacks have been evolving rapidly in recent years. Datacenters rely on fast, efficient “Software Data Planes”, which orchestrate data transfer between applications and I/O devices. The goal of this dissertation is to enhance the performance, efficiency, and scalability of software data planes by diagnosing their existing issues and addressing them through hardware-software solutions. In the first step, I characterize challenges of modern software data planes, which bypass the operating system kernel to avoid associated overheads. Since traditional interrupts and system calls cannot be delivered to user code without kernel assistance, kernel-bypass data planes use spinning cores on I/O queues to identify work/data arrival. Spin-polling obviously wastes CPU cycles on checking empty queues; however, I show that it entails even more drawbacks: (1) Full-tilt spinning cores perform more (useless) polling work when there is less work pending in the queues. (2) Spin-polling scales poorly with the number of polled queues due to processor cache capacity constraints, especially when traffic is unbalanced. (3) Spin-polling also scales poorly with the number of cores due to the overhead of polling and operation rate limits. (4) Whereas shared queues can mitigate load imbalance and head-of-line blocking, synchronization overheads of spinning on them limit their potential benefits. Next, I propose a notification accelerator, dubbed HyperPlane, which replaces spin-polling in software data planes. Design principles of HyperPlane are: (1) not iterating on empty I/O queues to find work/data in ready ones, (2) blocking/halting when all queues are empty rather than spinning fruitlessly, and (3) allowing multiple cores to efficiently monitor a shared set of queues. These principles lead to queue scalability, work proportionality, and enjoying theoretical merits of shared queues. HyperPlane is realized with a programming model front-end and a hardware microarchitecture back-end. Evaluation of HyperPlane shows its significant advantage in terms of throughput, average/tail latency, and energy efficiency over a state-of-the-art spin-polling-based software data plane, with very small power and area overheads. Finally, I focus on the data transfer aspect in software data planes. Cache misses incurred by accessing I/O data are a major bottleneck in software data planes. Despite considerable efforts put into delivering I/O data directly to the last-level cache, some access latency is still exposed. Cores cannot prefetch such data to nearer caches in today's systems because of the complex access pattern of data buffers and the lack of an appropriate notification mechanism that can trigger the prefetch operations. As such, I propose HyperData, a data transfer accelerator based on targeted prefetching. HyperData prefetches exact (rather than predicted) data buffers (or a required subset to avoid cache pollution) to the L1 cache of the consumer core at the right time. Prefetching can be done for both core-peripheral and core-core communications. HyperData's prefetcher is programmable and supports various queue formats—namely, direct (regular), indirect (Virtio), and multi-consumer queues. I show that with a minor overhead, HyperData effectively hides data access latency in software data planes, thereby improving both application- and system-level performance and efficiency.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/169826/1/hosseing_1.pd

    Doctor of Philosophy

    Get PDF
    dissertationThis dissertation explores three key facets of software algorithms for custom hardware ray tracing: primitive intersection, shading, and acceleration structure construction. For the first, primitive intersection, we show how nearly all of the existing direct three-dimensional (3D) ray-triangle intersection tests are mathematically equivalent. Based on this, a genetic algorithm can automatically tune a ray-triangle intersection test for maximum speed on a particular architecture. We also analyze the components of the intersection test to determine how much floating point precision is required and design a numerically robust intersection algorithm. Next, for shading, we deconstruct Perlin noise into its basic parts and show how these can be modified to produce a gradient noise algorithm that improves the visual appearance. This improved algorithm serves as the basis for a hardware noise unit. Lastly, we show how an existing bounding volume hierarchy can be postprocessed using tree rotations to further reduce the expected cost to traverse a ray through it. This postprocessing also serves as the basis for an efficient update algorithm for animated geometry. Together, these contributions should improve the efficiency of both software- and hardware-based ray tracers

    Microarchitectural techniques to reduce energy consumption in the memory hierarchy

    Get PDF
    This thesis states that dynamic profiling of the memory reference stream can improve energy and performance in the memory hierarchy. The research presented in this theses provides multiple instances of using lightweight hardware structures to profile the memory reference stream. The objective of this research is to develop microarchitectural techniques to reduce energy consumption at different levels of the memory hierarchy. Several simple and implementable techniques were developed as a part of this research. One of the techniques identifies and eliminates redundant refresh operations in DRAM and reduces DRAM refresh power. Another, reduces leakage energy in L2 and higher level caches for multiprocessor systems. The emphasis of this research has been to develop several techniques of obtaining energy savings in caches using a simple hardware structure called the counting Bloom filter (CBF). CBFs have been used to predict L2 cache misses and obtain energy savings by not accessing the L2 cache on a predicted miss. A simple extension of this technique allows CBFs to do way-estimation of set associative caches to reduce energy in cache lookups. Another technique using CBFs track addresses in a Virtual Cache and reduce false synonym lookups. Finally this thesis presents a technique to reduce dynamic power consumption in level one caches using significance compression. The significant energy and performance improvements demonstrated by the techniques presented in this thesis suggest that this work will be of great value for designing memory hierarchies of future computing platforms.Ph.D.Committee Chair: Lee, Hsien-Hsin S.; Committee Member: Cahtterjee,Abhijit; Committee Member: Mukhopadhyay, Saibal; Committee Member: Pande, Santosh; Committee Member: Yalamanchili, Sudhaka

    Codage réseau pour des applications multimédias avancées

    Get PDF
    Network coding is a paradigm that allows an efficient use of the capacity of communication networks. It maximizes the throughput in a multi-hop multicast communication and reduces the delay. In this thesis, we focus our attention to the integration of the network coding framework to multimedia applications, and in particular to advanced systems that provide enhanced video services to the users. Our contributions concern several instances of advanced multimedia communications: an efficient framework for transmission of a live stream making joint use of network coding and multiple description coding; a novel transmission strategy for lossy wireless networks that guarantees a trade-off between loss resilience and short delay based on a rate-distortion optimized scheduling of the video frames, that we also extended to the case of interactive multi-view streaming; a distributed social caching system that, using network coding in conjunction with the knowledge of the users' preferences in terms of views, is able to select a replication scheme such that to provide a high video quality by accessing only other members of the social group without incurring the access cost associated with a connection to a central server and without exchanging large tables of metadata to keep track of the replicated parts; and, finally, a study on using blind source separation techniques to reduce the overhead incurred by network coding schemes based on error-detecting techniques such as parity coding and message digest generation. All our contributions are aimed at using network coding to enhance the quality of video transmission in terms of distortion and delay perceivedLe codage réseau est un paradigme qui permet une utilisation efficace du réseau. Il maximise le débit dans un réseau multi-saut en multicast et réduit le retard. Dans cette thèse, nous concentrons notre attention sur l’intégration du codage réseau aux applications multimédias, et en particulier aux systèmes avancès qui fournissent un service vidéo amélioré pour les utilisateurs. Nos contributions concernent plusieurs scénarios : un cadre de fonctions efficace pour la transmission de flux en directe qui utilise à la fois le codage réseau et le codage par description multiple, une nouvelle stratégie de transmission pour les réseaux sans fil avec perte qui garantit un compromis entre la résilience vis-à-vis des perte et la reduction du retard sur la base d’une optimisation débit-distorsion de l'ordonnancement des images vidéo, que nous avons également étendu au cas du streaming multi-vue interactive, un système replication sociale distribuée qui, en utilisant le réseau codage en relation et la connaissance des préférences des utilisateurs en termes de vue, est en mesure de sélectionner un schéma de réplication capable de fournir une vidéo de haute qualité en accédant seulement aux autres membres du groupe social, sans encourir le coût d’accès associé à une connexion à un serveur central et sans échanger des larges tables de métadonnées pour tenir trace des éléments répliqués, et, finalement, une étude sur l’utilisation de techniques de séparation aveugle de source -pour réduire l’overhead encouru par les schémas de codage réseau- basé sur des techniques de détection d’erreur telles que le codage de parité et la génération de message digest
    • …
    corecore