20 research outputs found

    Efficient hardware architecture for fast IP address lookup

    Get PDF
    A multigigabit IP router may receive several millions packets per second from each input link. For each packet, the router needs to find the longest matching prefix in the forwarding table in order to determine the packet's next-hop. In this paper, we present an efficient hardware solution for the IP address lookup problem. We model the address lookup problem as a searching problem on a binary-trie. The binary-trie is partitioned into four levels of fixed size 255-node subtrees. We employ a hierarchical indexing structure to facilitate direct access to subtrees in a given level. It is estimated that a forwarding table with 40K prefixes will consume 2.5Mbytes of memory. The searching is implemented using a hardware pipeline with a minimum cycle of 12.5ns if the memory modules are implemented using SRAM. A distinguishing feature of our design is that forwarding table entries are not replicated in the data structure. Hence, table updates can be done in constant time with only a few memory accesses.published_or_final_versio

    MLET: A Power Efficient Approach for TCAM Based, IP Lookup Engines in Internet Routers

    Full text link
    Routers are one of the important entities in computer networks specially the Internet. Forwarding IP packets is a valuable and vital function in Internet routers. Routers extract destination IP address from packets and lookup those addresses in their own routing table. This task is called IP lookup. Internet address lookup is a challenging problem due to the increasing routing table sizes. Ternary Content-Addressable Memories (TCAMs) are becoming very popular for designing high-throughput address lookup-engines on routers: they are fast, cost-effective and simple to manage. Despite the TCAMs speed, their high power consumption is their major drawback. In this paper, Multilevel Enabling Technique (MLET), a power efficient TCAM based hardware architecture has been proposed. This scheme is employed after an Espresso-II minimization algorithm to achieve lower power consumption. The performance evaluation of the proposed approach shows that it can save considerable amount of routing table's power consumption.Comment: 14 Pages, IJCNC 201

    Flexible LDPC Decoder Architectures

    Get PDF
    Flexible channel decoding is getting significance with the increase in number of wireless standards and modes within a standard. A flexible channel decoder is a solution providing interstandard and intrastandard support without change in hardware. However, the design of efficient implementation of flexible low-density parity-check (LDPC) code decoders satisfying area, speed, and power constraints is a challenging task and still requires considerable research effort. This paper provides an overview of state-of-the-art in the design of flexible LDPC decoders. The published solutions are evaluated at two levels of architectural design: the processing element (PE) and the interconnection structure. A qualitative and quantitative analysis of different design choices is carried out, and comparison is provided in terms of achieved flexibility, throughput, decoding efficiency, and area (power) consumption

    A DRAM/SRAM memory scheme for fast packet buffers

    Get PDF
    We address the design of high-speed packet buffers for Internet routers. We use a general DRAM/SRAM architecture for which previous proposals can be seen as particular cases. For this architecture, large SRAMs are needed to sustain high line rates and a large number of interfaces. A novel algorithm for DRAM bank allocation is presented that reduces the SRAM size requirements of previously proposed schemes by almost an order of magnitude, without having memory fragmentation problems. A technological evaluation shows that our design can support thousands of queues for line rates up to 160 Gbps.Peer ReviewedPostprint (published version

    High performance communication on reconfigurable clusters

    Get PDF
    High Performance Computing (HPC) has matured to where it is an essential third pillar, along with theory and experiment, in most domains of science and engineering. Communication latency is a key factor that is limiting the performance of HPC, but can be addressed by integrating communication into accelerators. This integration allows accelerators to communicate with each other without CPU interactions, and even bypassing the network stack. Field Programmable Gate Arrays (FPGAs) are the accelerators that currently best integrate communication with computation. The large number of Multi-gigabit Transceivers (MGTs) on most high-end FPGAs can provide high-bandwidth and low-latency inter-FPGA connections. Additionally, the reconfigurable FPGA fabric enables tight coupling between computation kernel and network interface. Our thesis is that an application-aware communication infrastructure for a multi-FPGA system makes substantial progress in solving the HPC communication bottleneck. This dissertation aims to provide an application-aware solution for communication infrastructure for FPGA-centric clusters. Specifically, our solution demonstrates application-awareness across multiple levels in the network stack, including low-level link protocols, router microarchitectures, routing algorithms, and applications. We start by investigating the low-level link protocol and the impact of its latency variance on performance. Our results demonstrate that, although some link jitter is always present, we can still assume near-synchronous communication on an FPGA-cluster. This provides the necessary condition for statically-scheduled routing. We then propose two novel router microarchitectures for two different kinds of workloads: a wormhole Virtual Channel (VC)-based router for workloads with dynamic communication, and a statically-scheduled Virtual Output Queueing (VOQ)-based router for workloads with static communication. For the first (VC-based) router, we propose a framework that generates application-aware router configurations. Our results show that, by adding application-awareness into router configuration, the network performance of FPGA clusters can be substantially improved. For the second (VOQ-based) router, we propose a novel offline collective routing algorithm. This shows a significant advantage over a state-of-the-art collective routing algorithm. We apply our communication infrastructure to a critical strong-scaling HPC kernel, the 3D FFT. The experimental results demonstrate that the performance of our design is faster than that on CPUs and GPUs by at least one order of magnitude (achieving strong scaling for the target applications). Surprisingly, the FPGA cluster performance is similar to that of an ASIC-cluster. We also implement the 3D FFT on another multi-FPGA platform: the Microsoft Catapult II cloud. Its performance is also comparable or superior to CPU and GPU HPC clusters. The second application we investigate is Molecular Dynamics Simulation (MD). We model MD on both FPGA clouds and clusters. We find that combining processing and general communication in the same device leads to extremely promising performance and the prospect of MD simulations well into the us/day range with a commodity cloud

    Mapping applications onto FPGA-centric clusters

    Full text link
    High Performance Computing (HPC) is becoming increasingly important throughout science and engineering as ever more complex problems must be solved through computational simulations. In these large computational applications, the latency of communication between processing nodes is often the key factor that limits performance. An emerging alternative computer architecture that addresses the latency problem is the FPGA-centric cluster (FCC); in these systems, the devices (FPGAs) are directly interconnected and thus many layers of hardware and software are avoided. The result can be scalability not currently achievable with other technologies. In FCCs, FPGAs serve multiple functions: accelerator, network interface card (NIC), and router. Moreover, because FPGAs are configurable, there is substantial opportunity to tailor the router hardware to the application; previous work has demonstrated that such application-aware configuration can effect a substantial improvement in hardware efficiency. One constraint of FCCs is that it is convenient for their interconnect to be static, direct, and have a two or three dimensional mesh topology. Thus, applications that are naturally of a different dimensionality (have a different logical topology) from that of the FCC must be remapped to obtain optimal performance. In this thesis we study various aspects of the mapping problem for FCCs. There are two major research thrusts. The first is finding the optimal mapping of logical to physical topology. This problem has received substantial attention by both the theory community, where topology mapping is referred to as graph embedding, and by the High Performance Computing (HPC) community, where it is a question of process placement. We explore the implications of the different mapping strategies on communication behavior in FCCs, especially on resulting load imbalance. The second major research thrust is built around the hypothesis that applications that need to be remapped (due to differing logical and physical topologies) will have different optimal router configurations from those applications that do not. For example, due to remapping, some virtual or physical communication links may have little occupancy; therefore fewer resources should be allocated to them. Critical here is the creation of a new set of parameterized hardware features that can be configured to best handle load imbalances caused by remapping. These two thrusts form a codesign loop: certain mapping algorithms may be differentially optimal due to application-aware router reconfiguration that accounts for this mapping. This thesis has four parts. The first part introduces the background and previous work related to communication in general and, in particular, how it is implemented in FCCs. We build on previous work on application-aware router configuration. The second part introduces topology mapping mechanisms including those derived from graph embeddings and a greedy algorithm commonly used in HPC. In the third part, topology mappings are evaluated for performance and imbalance; we note that different mapping strategies lead to different imbalances both in the overall network and in each node. The final part introduces reconfigure router design that allocates resources based on different imbalance situations caused by different mapping behaviors

    On the Edge of Secure Connectivity via Software-Defined Networking

    Get PDF
    Securing communication in computer networks has been an essential feature ever since the Internet, as we know it today, was started. One of the best known and most common methods for secure communication is to use a Virtual Private Network (VPN) solution, mainly operating with an IP security (IPsec) protocol suite originally published in 1995 (RFC1825). It is clear that the Internet, and networks in general, have changed dramatically since then. In particular, the onset of the Cloud and the Internet-of-Things (IoT) have placed new demands on secure networking. Even though the IPsec suite has been updated over the years, it is starting to reach the limits of its capabilities in its present form. Recent advances in networking have thrown up Software-Defined Networking (SDN), which decouples the control and data planes, and thus centralizes the network control. SDN provides arbitrary network topologies and elastic packet forwarding that have enabled useful innovations at the network level. This thesis studies SDN-powered VPN networking and explains the benefits of this combination. Even though the main context is the Cloud, the approaches described here are also valid for non-Cloud operation and are thus suitable for a variety of other use cases for both SMEs and large corporations. In addition to IPsec, open source TLS-based VPN (e.g. OpenVPN) solutions are often used to establish secure tunnels. Research shows that a full-mesh VPN network between multiple sites can be provided using OpenVPN and it can be utilized by SDN to create a seamless, resilient layer-2 overlay for multiple purposes, including the Cloud. However, such a VPN tunnel suffers from resiliency problems and cannot meet the increasing availability requirements. The network setup proposed here is similar to Software-Defined WAN (SD-WAN) solutions and is extremely useful for applications with strict requirements for resiliency and security, even if best-effort ISP is used. IPsec is still preferred over OpenVPN for some use cases, especially by smaller enterprises. Therefore, this research also examines the possibilities for high availability, load balancing, and faster operational speeds for IPsec. We present a novel approach involving the separation of the Internet Key Exchange (IKE) and the Encapsulation Security Payload (ESP) in SDN fashion to operate from separate devices. This allows central management for the IKE while several separate ESP devices can concentrate on the heavy processing. Initially, our research relied on software solutions for ESP processing. Despite the ingenuity of the architectural concept, and although it provided high availability and good load balancing, there was no anti-replay protection. Since anti-replay protection is vital for secure communication, another approach was required. It thus became clear that the ideal solution for such large IPsec tunneling would be to have a pool of fast ESP devices, but to confine the IKE operation to a single centralized device. This would obviate the need for load balancing but still allow high availability via the device pool. The focus of this research thus turned to the study of pure hardware solutions on an FPGA, and their feasibility and production readiness for application in the Cloud context. Our research shows that FPGA works fluently in an SDN network as a standalone IPsec accelerator for ESP packets. The proposed architecture has 10 Gbps throughput, yet the latency is less than 10 µs, meaning that this architecture is especially efficient for data center use and offers increased performance and latency requirements. The high demands of the network packet processing can be met using several different approaches, so this approach is not just limited to the topics presented in this thesis. Global network traffic is growing all the time, so the development of more efficient methods and devices is inevitable. The increasing number of IoT devices will result in a lot of network traffic utilising the Cloud infrastructures in the near future. Based on the latest research, once SDN and hardware acceleration have become fully integrated into the Cloud, the future for secure networking looks promising. SDN technology will open up a wide range of new possibilities for data forwarding, while hardware acceleration will satisfy the increased performance requirements. Although it still remains to be seen whether SDN can answer all the requirements for performance, high availability and resiliency, this thesis shows that it is a very competent technology, even though we have explored only a minor fraction of its capabilities

    A Modular Approach to Adaptive Reactive Streaming Systems

    Get PDF
    The latest generations of FPGA devices offer large resource counts that provide the headroom to implement large-scale and complex systems. However, there are increasing challenges for the designer, not just because of pure size and complexity, but also in harnessing effectively the flexibility and programmability of the FPGA. A central issue is the need to integrate modules from diverse sources to promote modular design and reuse. Further, the capability to perform dynamic partial reconfiguration (DPR) of FPGA devices means that implemented systems can be made reconfigurable, allowing components to be changed during operation. However, use of DPR typically requires low-level planning of the system implementation, adding to the design challenge. This dissertation presents ReShape: a high-level approach for designing systems by interconnecting modules, which gives a ‘plug and play’ look and feel to the designer, is supported by tools that carry out implementation and verification functions, and is carried through to support system reconfiguration during operation. The emphasis is on the inter-module connections and abstracting the communication patterns that are typical between modules – for example, the streaming of data that is common in many FPGA-based systems, or the reading and writing of data to and from memory modules. ShapeUp is also presented as the static precursor to ReShape. In both, the details of wiring and signaling are hidden from view, via metadata associated with individual modules. ReShape allows system reconfiguration at the module level, by supporting type checking of replacement modules and by managing the overall system implementation, via metadata associated with its FPGA floorplan. The methodology and tools have been implemented in a prototype for a broad domain-specific setting – networking systems – and have been validated on real telecommunications design projects

    Paketverarbeitende Systeme – Algorithmen und Architekturen für hohe Verarbeitungsgeschwindigkeiten

    Get PDF
    Paketverarbeitende Systeme basieren auf zwei wichtigen Bereichen: der eigentlichen Paketverarbeitung und der Paketklassifizierung. Zur Optimierung der Paketverarbeitung wurde eine FPGA-basierte Architektur für funktionale Module entwickelt, auf deren Basis verschiedene Funktionen für den Einsatz im Teilnehmerzugangsnetzwerk realisiert wurden. Hash-basierte Klassifizierungsalgorithmen und –architekturen sind grundsätzlich für die Paketklassifizierung geeignet. Zur Optimierung dieser Klasse von Algorithmen wurde eine evolvierbare Hashfunktionsarchitektur entwickelt.Packet processing systems base on two important areas: ultimate packet and packet classification. For optimization of packet processing an FPGA-based architecture for functional modules has been developed. For application in Access Network, different functional modules have been developed. Hash-based classification algorithms and architectures are appropriate for fast packet classification. For optimization of this class of algorithms, an evolvable architecture for hash functions has been developed
    corecore