189 research outputs found
ADVANCED HASHING SCHEMES FOR PACKETFORWARDING USING SET ASSOCIATIVEMEMORY ARCHITECTURES
Building a high performance IP packet forwarding (PF) engine remains a challenge due to increasingly stringent throughput requirements and the growing sizes of IP forwarding tables.The router has to match the incoming packet's IP address against the forwarding table.The matching process has to be done in wire speed which is why scalability and low power consumption are features that PF engines must maintain.It is common for PF engines to use hash tables; however, the classic hashing downsides have to be dealt with (e.g., collisions, worst case memory access time, ... etc.).While open addressing hash tables, in general, provide good average case search performance, their memory utilization and worst case performance can degrade quickly due to collisions that leads to bucket overflows.Set associative memory can be used for hardware implementations of hash tables with the property that each bucket of a hash table can be searched in one memory cycle.Hence, PF engine architectures based on associative memory will outperform those based on the conventional Ternary Content Addressable Memory (TCAM) in terms of power and scalability.The two standard solutions to the overflow problem are either to use some sort of predefined probing (e.g., linear or quadratic) or to use multiple hash functions.This work presents two new hash schemes that extend both aforementioned solutions to tackle the overflow problem efficiently.The first scheme is a hash probing scheme that is called Content-based HAsh Probing, or CHAP.CHAP is a probing scheme that is based on the content of the hash table to avoid the classical side effects of predefined hash probing methods (i.e., primary and secondary clustering phenomena) and at the same time reduces the overflow.The second scheme, called Progressive Hashing, or PH, is a general multiple hash scheme that reduces the overflow as well.PH splits the prefixes into groups where each group is assigned one hash function, then reuse some hash functions in a progressive fashion to reduce the overflow.We show by experimenting with real IP lookup tables that both schemes outperform other hashing schemes
High-Performance Packet Processing Engines Using Set-Associative Memory Architectures
The emergence of new optical transmission technologies has led to ultra-high Giga bits per second (Gbps) link speeds.
In addition, the switch from 32-bit long IPv4 addresses to the 128-bit long IPv6 addresses is currently progressing.
Both factors make it hard for new Internet routers and firewalls to keep up with wire-speed packet-processing.
By packet-processing we mean three applications: packet forwarding, packet classification and deep packet inspection.
In packet forwarding (PF), the router has to match the incoming packet's IP address against the forwarding table.
It then directs each packet to its next hop toward its final destination.
A packet classification (PC) engine examines a packet header by matching it against a database of rules, or filters, to obtain the best matching rule.
Rules are associated with either an ``action'' (e.g., firewall) or a ``flow ID'' (e.g., quality of service or QoS).
The last application is deep packet inspection (DPI) where the firewall has to inspect the actual packet payload for malware or network attacks.
In this case, the payload is scanned against a database of rules, where each rule is either a plain text string or a regular expression.
In this thesis, we introduce a family of hardware solutions that combine the above requirements.
These solutions rely on a set-associative memory architecture that is called CA-RAM (Content Addressable-Random Access Memory).
CA-RAM is a hardware implementation of hash tables with the property that each bucket of a hash table can be searched in one memory cycle.
However, the classic hashing downsides have to be dealt with, such as collisions that lead to overflow and worst-case memory access time.
The two standard solutions to the overflow problem are either to use some predefined probing (e.g., linear or quadratic) or to use multiple hash functions.
We present new hash schemes that extend both aforementioned solutions to tackle the overflow problem efficiently.
We show by experimenting with real IP lookup tables, synthetic packet classification rule sets and real DPI databases that our schemes outperform other previously proposed schemes
CHAP : Enabling efficient hardware-based multiple hash schemes for IP lookup
Building a high performance IP lookup engine remains a challenge due to increasingly stringent throughput requirements and the growing size of IP tables. An emerging approach for IP lookup is the use of set associative memory architecture, which is basically a hardware implementation of an open addressing hash table with the property that each row of the hash table can be searched in one memory cycle. While open addressing hash tables, in general, provide good average-case search performance, their memory utilization and worst-case performance can degrade quickly due to bucket overflows. This paper presents a new simple hash probing scheme called CHAP (Content-based HAsh Probing) that tackles the hash overflow problem. In CHAP, the probing is based on the content of the hash table, thus avoiding the classical side effects of probing. We show through experimenting with real IP tables how CHAP can effectively deal with the overflow. © IFIP International Federation for Information Processing 2009
Recommended from our members
Scalable hardware memory disambiguation
This dissertation deals with one of the long-standing problems in Computer Architecture
– the problem of memory disambiguation. Microprocessors typically reorder
memory instructions during execution to improve concurrency. Such microprocessors
use hardware memory structures for memory disambiguation, known as LoadStore
Queues (LSQs), to ensure that memory instruction dependences are satisfied
even when the memory instructions execute out-of-order. A typical LSQ implementation
(circa 2006) holds all in-flight memory instructions in a physically centralized
LSQ and performs a fully associative search on all buffered instructions to ensure
that memory dependences are satisfied. These LSQ implementations do not scale
because they use large, fully associative structures, which are known to be slow and
power hungry. The increasing trend towards distributed microarchitectures further
exacerbates these problems. As on-chip wire delays increase and high-performance
processors become necessarily distributed, centralized structures such as the LSQ
can limit scalability.
This dissertation describes techniques to create scalable LSQs in both centralized
and distributed microarchitectures. The problems and solutions described
in this thesis are motivated and validated by real system designs. The dissertation
starts with a description of the partitioned primary memory system of the TRIPS
processor, of which the LSQ is an important component, and then through a series
of optimizations describes how the power, area, and centralization problems
of the LSQ can be solved with minor performance losses (if at all) even for large
number of in flight memory instructions. The four solutions described in this dissertation
— partitioning, filtering, late binding and efficient overflow management —
enable power-, area-efficient, distributed and scalable LSQs, which in turn enable
aggressive large-window processors capable of simultaneously executing thousands
of instructions.
To mitigate the power problem, we replaced the power-hungry, fully associative
search with a power-efficient hash table lookup using a simple address-based
Bloom filter. Bloom filters are probabilistic data structures used for testing set
membership and can be used to quickly check if an instruction with the same data
address is likely to be found in the LSQ without performing the associative search.
Bloom filters typically eliminate more than 80% of the associative searches and they
are highly effective because in most programs, it is uncommon for loads and stores
to have the same data address and be in execution simultaneously.
To rectify the area problem, we observe the fact that only a small fraction
of all memory instructions are dependent, that only such dependent instructions
need to be buffered in the LSQ, and that these instructions need to be in the LSQ
only for certain parts of the pipelined execution. We propose two mechanisms to
exploit these observations. The first mechanism, area filtering, is a hardware mechanism
that couples Bloom filters and dependence predictors to dynamically identify
and buffer only those instructions which are likely to be dependent. The second
mechanism, late binding, reduces the occupancy and hence size of the LSQ. Both of
these optimizations allows the number of LSQ slots to be reduced by up to one-half
compared to a traditional organization without any performance degradation.
Finally, we describe a new decentralized LSQ design for handling LSQ structural
hazards in distributed microarchitectures. Decentralization of LSQs, and to
a large extent distributed microarchitectures with memory speculation, has proved
to be impractical because of the high performance penalties associated with the
mechanisms for dealing with hazards. To solve this problem, we applied classic
flow-control techniques from interconnection networks for handling resource con-
flicts. The first method, memory-side buffering, buffers the overflowing instructions
in a separate buffer near the LSQs. The second scheme, execution-side NACKing,
sends the overflowing instruction back to the issue window from which it is later
re-issued. The third scheme, network buffering, uses the buffers in the interconnection
network between the execution units and memory to hold instructions when the
LSQ is full, and uses virtual channel flow control to avoid deadlocks. The network
buffering scheme is the most robust of all the overflow schemes and shows less than
1% performance degradation due to overflows for a subset of SPEC CPU 2000 and
EEMBC benchmarks on a cycle-accurate simulator that closely models the TRIPS
processor.
The techniques proposed in this dissertation are independent, architectureneutral
and their cumulative benefits result in LSQs that can be partitioned at a
fine granularity and have low design complexity. Each of these partitions selectively
buffers only memory instructions with true dependences and can be closely coupled
with the execution units thus minimizing power, area, and latency. Such LSQ
designs with near-ideal characteristics are well suited for microarchitectures with
thousands of instructions in-flight and may enable even more aggressive microarchitectures
in the future.Computer Science
Branch Prediction For Network Processors
Originally designed to favour flexibility over packet processing performance, the future of the programmable network processor is challenged by the need to meet both increasing line rate as well as providing additional processing capabilities. To meet these requirements, trends within networking research has tended to focus on techniques such as offloading computation intensive tasks to dedicated hardware logic or through increased parallelism. While parallelism retains flexibility, challenges such as load-balancing limit its scope. On the other hand, hardware offloading allows complex algorithms to be implemented at high speed but sacrifice flexibility. To this end, the work in this thesis is focused on a more fundamental aspect of a network processor, the data-plane processing engine.
Performing both system modelling and analysis of packet processing functions; the goal of this thesis is to identify and extract salient information regarding the performance of multi-processor workloads. Following on from a traditional software based analysis of programme workloads, we develop a method of modelling and analysing hardware accelerators when applied to network processors. Using this quantitative information, this thesis proposes an architecture which allows deeply pipelined micro-architectures to be implemented on the data-plane while reducing the branch penalty associated with these architectures
Algorithms and Architectures for Network Search Processors
The continuous growth in the Internet’s size, the amount of data traffic, and the complexity of processing this traffic gives rise to new challenges in building high-performance network devices. One of the most fundamental tasks performed by these devices is searching the network data for predefined keys. Address lookup, packet classification, and deep packet inspection are some of the operations which involve table lookups and searching. These operations are typically part of the packet forwarding mechanism, and can create a performance bottleneck. Therefore, fast and resource efficient algorithms are required. One of the most commonly used techniques for such searching operations is the Ternary Content Addressable Memory (TCAM). While TCAM can offer very fast search speeds, it is costly and consumes a large amount of power. Hence, designing cost-effective, power-efficient, and high-speed search techniques has received a great deal of attention in the research and industrial community. In this thesis, we propose a generic search technique based on Bloom filters. A Bloom filter is a randomized data structure used to represent a set of bit-strings compactly and support set membership queries. We demonstrate techniques to convert the search process into table lookups. The resulting table data structures are kept in the off-chip memory and their Bloom filter representations are kept in the on-chip memory. An item needs to be looked up in the off-chip table only when it is found in the on-chip Bloom filters. By filtering the off-chip memory accesses in this fashion, the search operations can be significantly accelerated. Our approach involves a unique combination of algorithmic and architectural techniques that outperform some of the current techniques in terms of cost-effectiveness, speed, and power-efficiency
Software-Driven and Virtualized Architectures for Scalable 5G Networks
In this dissertation, we argue that it is essential to rearchitect 4G cellular core networks–sitting between the Internet and the radio access network–to meet the scalability, performance, and flexibility requirements of 5G networks. Today, there is a growing consensus among operators and research community that software-defined networking (SDN), network function virtualization (NFV), and mobile edge computing (MEC) paradigms will be the key ingredients of the next-generation cellular networks. Motivated by these trends, we design and optimize three core network architectures, SoftMoW, SoftBox, and SkyCore, for different network scales, objectives, and conditions. SoftMoW provides global control over nationwide core networks with the ultimate goal of enabling new routing and mobility optimizations. SoftBox attempts to enhance policy enforcement in statewide core networks to enable low-latency, signaling-efficient, and customized services for mobile devices. Sky- Core is aimed at realizing a compact core network for citywide UAV-based radio networks that are going to serve first responders in the future. Network slicing techniques make it possible to deploy these solutions on the same infrastructure in parallel. To better support mobility and provide verifiable security, these architectures can use an addressing scheme that separates network locations and identities with self-certifying, flat and non-aggregatable address components. To benefit the proposed architectures, we designed a high-speed and memory-efficient router, called Caesar, for this type of addressing schemePHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/146130/1/moradi_1.pd
Fully Programming the Data Plane: A Hardware/Software Approach
Les réseaux définis par logiciel — en anglais Software-Defined Networking (SDN) — sont apparus ces dernières années comme un nouveau paradigme de réseau. SDN introduit une séparation entre les plans de gestion, de contrôle et de données, permettant à ceux-ci d’évoluer de manière indépendante, rompant ainsi avec la rigidité des réseaux traditionnels. En particulier, dans le plan de données, les avancées récentes ont porté sur la définition des langages
de traitement de paquets, tel que P4, et sur la définition d’architectures de commutateurs programmables, par exemple la Protocol Independent Switch Architecture (PISA). Dans cette thèse, nous nous intéressons a l’architecture PISA et évaluons comment exploiter les FPGA comme plateforme de traitement efficace de paquets. Cette problématique est
étudiée a trois niveaux d’abstraction : microarchitectural, programmation et architectural. Au niveau microarchitectural, nous avons proposé une architecture efficace d’un analyseur d’entêtes de paquets pour PISA. L’analyseur de paquets utilise une architecture pipelinée avec propagation en avant — en anglais feed-forward. La complexité de l’architecture est réduite par rapport à l’état de l’art grâce a l’utilisation d’optimisations algorithmiques. Finalement, l’architecture est générée par un compilateur P4 vers C++, combiné à un outil de synthèse de haut niveau. La solution proposée atteint un débit de 100 Gb/s avec une latence comparable à celle d’analyseurs d’entêtes de paquets écrits à la main. Au niveau de la programmation, nous avons proposé une nouvelle méthodologie de conception de synthèse de haut niveau visant à améliorer conjointement la qualité logicielle et matérielle. Nous exploitons les fonctionnalités du C++ moderne pour améliorer à la fois la modularité et la lisibilité du code, tout en conservant (ou améliorant) les résultats du matériel généré.
Des exemples de conception utilisant notre méthodologie, incluant pour l’analyseur d’entête de paquets, ont été rendus publics.----------ABSTRACT: Software-Defined Networking (SDN) has emerged in recent years as a new network paradigm to de-ossify communication networks. Indeed, by offering a clear separation of network concerns
between the management, control, and data planes, SDN allows each of these planes to evolve independently, breaking the rigidity of traditional networks. However, while well
spread in the control and management planes, this de-ossification has only recently reached the data plane with the advent of packet processing languages, e.g. P4, and novel programmable switch architectures, e.g. Protocol Independent Switch Architecture (PISA). In this work, we focus on leveraging the PISA architecture by mainly exploiting the FPGA capabilities for efficient packet processing. In this way, we address this issue at different
abstraction levels: i) microarchitectural; ii) programming; and, iii) architectural. At the microarchitectural level, we have proposed an efficient FPGA-based packet parser
architecture, which is a major PISA’s component. The proposed packet parser follows a feedforward
pipeline architecture in which the internal microarchitectural has been meticulously optimized for FPGA implementation. The architecture is automatically generated by a P4- to-C++ compiler after several rounds of graph optimizations. The proposed solution achieves 100 Gb/s line rate with latency comparable to hand-written packet parsers. The throughput scales from 10 Gb/s to 160 Gb/s with moderate increase in resource consumption. Both the compiler and the packet parser codebase have been open-sourced to permit reproducibility. At the programming level, we have proposed a novel High-Level Synthesis (HLS) design methodology aiming at improving software and hardware quality. We have employed this novel methodology when designing the packet parser. In our work, we have exploited features of modern C++ that improves at the same time code modularity and readability while keeping (or improving) the results of the generated hardware. Design examples using our methodology have been publicly released
- …