95 research outputs found
On the Exploration of FPGAs and High-Level Synthesis Capabilities on Multi-Gigabit-per-Second Networks
Tesis doctoral inédita leída en la Universidad Autónoma de Madrid, Escuela Politécnica Superior, Departamento de Tecnología Electrónica y de las Comunicaciones. Fecha de lectura: 24-01-2020Traffic on computer networks has faced an exponential grown in recent years.
Both links and communication equipment had to adapt in order to provide
a minimum quality of service required for current needs. However, in recent
years, a few factors have prevented commercial off-the-shelf hardware from
being able to keep pace with this growth rate, consequently, some software tools are
struggling to fulfill their tasks, especially at speeds higher than 10 Gbit/s. For this reason,
Field Programmable Gate Arrays (FPGAs) have arisen as an alternative to address the
most demanding tasks without the need to design an application specific integrated
circuit, this is in part to their flexibility and programmability in the field. Needless to say,
developing for FPGAs is well-known to be complex. Therefore, in this thesis we tackle
the use of FPGAs and High-Level Synthesis (HLS) languages in the context of computer
networks. We focus on the use of FPGA both in computer network monitoring application
and reliable data transmission at very high-speed. On the other hand, we intend to shed
light on the use of high level synthesis languages and boost FPGA applicability in the
context of computer networks so as to reduce development time and design complexity.
In the first part of the thesis, devoted to computer network monitoring. We take advantage
of the FPGA determinism in order to implement active monitoring probes, which
consist on sending a train of packets which is later used to obtain network parameters.
In this case, the determinism is key to reduce the uncertainty of the measurements.
The results of our experiments show that the FPGA implementations are much more
accurate and more precise than the software counterpart. At the same time, the FPGA
implementation is scalable in terms of network speed — 1, 10 and 100 Gbit/s. In the context of passive monitoring, we leverage the FPGA architecture to implement algorithms
able to thin cyphered traffic as well as removing duplicate packets. These two algorithms
straightforward in principle, but very useful to help traditional network analysis tools to
cope with their task at higher network speeds. On one hand, processing cyphered traffic
bring little benefits, on the other hand, processing duplicate traffic impacts negatively in
the performance of the software tools.
In the second part of the thesis, devoted to the TCP/IP stack. We explore the current
limitations of reliable data transmission using standard software at very high-speed.
Nowadays, the network is becoming an important bottleneck to fulfill current needs, in
particular in data centers. What is more, in recent years the deployment of 100 Gbit/s
network links has started. Consequently, there has been an increase scrutiny of how
networking functionality is deployed, furthermore, a wide range of approaches are
currently being explored to increase the efficiency of networks and tailor its functionality
to the actual needs of the application at hand. FPGAs arise as the perfect alternative to
deal with this problem. For this reason, in this thesis we develop Limago an FPGA-based
open-source implementation of a TCP/IP stack operating at 100 Gbit/s for Xilinx’s FPGAs.
Limago not only provides an unprecedented throughput, but also, provides a tiny latency
when compared to the software implementations, at least fifteen times. Limago is a key
contribution in some of the hottest topic at the moment, for instance, network-attached
FPGA and in-network data processing
TCP/IP kiihdytys pilvipohjaisessa mobiiliverkossa
Mobile traffic rates are in constant growth. The currently used technology, long-term evolution (LTE), is already in a mature state and receives only small incremental improvements. However, a new major paradigm shift is needed to support future development. Together with the transition to the fifth generation of mobile telecommunications, companies are moving towards network function virtualization (NFV). By decoupling network functions from the hardware it is possible to achieve lower development and management costs as well as better scalability.
Major change from dedicated hardware to the cloud does not take place without issues. One key challenge is building a telecommunications-grade ultra-low-latency and low-jitter data storage for call session data. Once overcome, it enables new ways to build much simpler stateless radio applications.
There are many technologies which can be used to achieve lower latencies in the cloud infrastructure. In the future, technologies such as memory-centric computing can revolutionize the whole infrastructure and provide nanosecond latencies. However, on the short term, viable solutions are purely software-based. Examples of these are databases and transport layer protocols optimized for latency. Traffic processing can also be accelerated by using libraries and drivers such as the Data Plane Development Kit (DPDK). However, DPDK does not have transport layer support, so additional frameworks are needed to unleash the potential of Transmission Control Protocol/Internet Protocol (TCP/IP) acceleration.
In this thesis TCP/IP acceleration is studied as a method for providing ultra-low-latency and low-jitter communications for call session data storage. Two major frameworks -- namely, VPP and F-Stack -- were selected for evaluation. The major finding is that the frameworks are not as mature as expected, and thus they failed to deliver production-ready performance. Building robust interface for applications to use was recognized as a common problem in the market.Mobiiliverkon datamäärät ovat jatkuvassa nousussa. Nykyisin käytössä olevaa neljännen sukupolven matkapuhelintekniikkaan (4G) tehdään enää pieniä päivityksiä. Jotta tulevaisuuden datamääriin pystytään vastaamaan, täytyy tekniikan ottaa seuraava suuri harppaus. Siirryttäessä viidennen sukupolven matkapuhelintekniikkaan (5G), siirtyvät yritykset myös kohti verkon funktioiden virtualisointia. Erottamalla verkon funktiot laitteistosta pystytään saavuttamaan entistä matalammat kehitys- ja hallintakustannukset, sekä parempi skaalautuvuus.
Siirtymä erityislaitteistosta pilveen on haasteellinen. Yksi keskeisimmistä ongelmista on matalaan ja tasaiseen viiveeseen pystyvän tietovaraston rakentaminen yhteyksien käsittelyyn. Jos tähän haasteeseen pystytään vastaamaan, mahdollistaa se uudenlaisten yksinkertaisten tilattomien radioapplikaatioiden suunnittelun.
Matalaa viivettä pystytään tavoittelemaan monella tapaa. Tulevaisuudessa muistikeskeinen laskenta saattaa mullistaa koko infrastruktuurin ja mahdollistaa nanosekuntien viiveet. Tämä ei kuitenkaan ole mahdollista lyhyellä mittakaavalla, joten ratkaisuja pitää etsiä ohjelmistoratkaisuista. Tämä tarkoittaa esimerkiksi tietokannan tai kuljetuskerroksen protokollan optimointia viivettä ajatellen. Lii-kenteen prosessointia voi myös kiihdyttää erilaisilla kirjastoilla ja ajureilla, kuten Data plane development kitillä (DPDK). DPDK ei tue kuljetuskerroksen kiihdytystä, joten tähän joudutaan käyttämään erillisiä ohjelmistoja.
Tässä diplomityössä tutkitaan, pystyvätkö TCP/IP-kiihdytystä tarjoavat ohjelmointikehykset lyhentämään viivettä riittävästi yhteyksien tilannedatan varastoinnin tarpeisiin. Kahden yleisimmän ohjelmiston, VPP ja F-Stack, suorituskyky mitataan. Tutkimuksen tuloksena havaittiin, että kumpikaan ohjelmisto ei ole riit-tävän valmis tuotantokäyttöön. Yhteinen ongelma kaikissa tutkituissa ohjelmistoissa oli rajapinta, jota tarjotaan applikaation käytettäväksi
Multilayer Environment and Toolchain for Holistic NetwOrk Design and Analysis
The recent developments and research in distributed ledger technologies and
blockchain have contributed to the increasing adoption of distributed systems.
To collect relevant insights into systems' behavior, we observe many evaluation
frameworks focusing mainly on the system under test throughput. However, these
frameworks often need more comprehensiveness and generality, particularly in
adopting a distributed applications' cross-layer approach. This work analyses
in detail the requirements for distributed systems assessment. We summarize
these findings into a structured methodology and experimentation framework
called METHODA. Our approach emphasizes setting up and assessing a broader
spectrum of distributed systems and addresses a notable research gap. We
showcase the effectiveness of the framework by evaluating four distinct systems
and their interaction, leveraging a diverse set of eight carefully selected
metrics and 12 essential parameters. Through experimentation and analysis we
demonstrate the framework's capabilities to provide valuable insights across
various use cases. For instance, we identify that a combination of Trusted
Execution Environments with threshold signature scheme FROST introduces minimal
overhead on the performance with average latency around \SI{40}{\ms}. We
showcase an emulation of realistic systems behavior, e.g., Maximal Extractable
Value is possible and could be used to further model such dynamics. The METHODA
framework enables a deeper understanding of distributed systems and is a
powerful tool for researchers and practitioners navigating the complex
landscape of modern computing infrastructures
Recommended from our members
Survey on System I/O Hardware Transactions and Impact on Latency, Throughput, and Other Factors
Computer system I/O has evolved with processor and memory technologies in terms of reducing latency, increasing bandwidth and other factors. As requirements increase for I/O, such as networking, storage, and video, descriptor-based DMA transactions have become more important in high performance systems to move data between I/O adapters and system memory buffers. DMA transactions are done with hardware engines below the software protocol abstraction layers in all systems other than rudimentary embedded controllers. CPUs can switch to other tasks by offloading hardware DMA transfers to the I/O adapters. Each I/O interface has one or more separately instantiated descriptor-based DMA engines optimized for a given I/O port. I/O transactions are optimized by accelerator functions to reduce latency, improve throughput and reduce CPU overhead. This chapter surveys the current state of high-performance I/O architecture advances and explores benefits and limitations. With the proliferation of CPU multi-cores within a system, multi-GB/s ports, and on-die integration of system functions, changes beyond the techniques surveyed may be needed for optimal I/O architecture performance.This is an author's peer-reviewed final manuscript, as accepted by the publisher. The published article/chapter is copyrighted by Elsevier and can be found at: http://www.elsevier.com/books/advances-in-computers/hurson/978-0-12-420232-0.Keywords: memory, controllers, processors, DMA, input/output, latency, power, throughpu
ACCL+: an FPGA-Based Collective Engine for Distributed Applications
FPGAs are increasingly prevalent in cloud deployments, serving as Smart NICs
or network-attached accelerators. Despite their potential, developing
distributed FPGA-accelerated applications remains cumbersome due to the lack of
appropriate infrastructure and communication abstractions. To facilitate the
development of distributed applications with FPGAs, in this paper we propose
ACCL+, an open-source versatile FPGA-based collective communication library.
Portable across different platforms and supporting UDP, TCP, as well as RDMA,
ACCL+ empowers FPGA applications to initiate direct FPGA-to-FPGA collective
communication. Additionally, it can serve as a collective offload engine for
CPU applications, freeing the CPU from networking tasks. It is user-extensible,
allowing new collectives to be implemented and deployed without having to
re-synthesize the FPGA circuit. We evaluated ACCL+ on an FPGA cluster with 100
Gb/s networking, comparing its performance against software MPI over RDMA. The
results demonstrate ACCL+'s significant advantages for FPGA-based distributed
applications and highly competitive performance for CPU applications. We
showcase ACCL+'s dual role with two use cases: seamlessly integrating as a
collective offload engine to distribute CPU-based vector-matrix multiplication,
and serving as a crucial and efficient component in designing fully FPGA-based
distributed deep-learning recommendation inference
Enabling the use of embedded and mobile technologies for high-performance computing
In the late 1990s, powerful economic forces led to the adoption of commodity desktop processors in High-Performance Computing(HPC). This transformation has been so effective that the November 2016 TOP500 list is still dominated by x86 architecture.
In 2016, the largest commodity market in computing is not PCs or servers, but mobile computing, comprising smartphones andtablets, most of which are built with ARM-based Systems on Chips (SoC). This suggests that once mobile SoCs deliver sufficient performance, mobile SoCs can help reduce the cost of HPC.
This thesis addresses this question in detail.We analyze the trend in mobile SoC performance, comparing it with the similar trend in the 1990s. Through development of real system prototypes and their performance analysis we assess the feasibility of building an HPCsystem based on mobile SoCs. Through simulation of the future mobile SoC, we identify the missing features and suggest improvements that would enable theuse of future mobile SoCs in HPC environment.
Thus, we present design guidelines for future generations mobile SoCs, and HPC systems built around them, enabling the newclass of cheap supercomputers.A finales de la década de los 90, razones económicas llevaron a la adopción de procesadores de uso general en sistemas de Computación de Altas Prestaciones (HPC). Esta transformación ha sido tan efectiva que la lista TOP500 de noviembre de 2016 sigue aun dominada por la arquitectura x86. En 2016, el mayor mercado de productos básicos en computación no son los ordenadores de sobremesa o los servidores, sino la computación móvil, que incluye teléfonos inteligentes y tabletas, la mayoría de los cuales están construidos con sistemas en chip(SoC) de arquitectura ARM. Esto sugiere que una vez que los SoC móviles ofrezcan un rendimiento suficiente, podrán utilizarse para reducir el costo desistemas HPC. Esta tesis aborda esta cuestión en detalle. Analizamos la tendencia del rendimiento de los SoC para móvil, comparándola con la tendencia similar ocurrida en los añosnoventa. A través del desarrollo de prototipos de sistemas reales y su análisis de rendimiento, evaluamos la factibilidad de construir unsistema HPC basado en SoCs móviles. A través de la simulación de SoCs móviles futuros, identificamos las características que faltan y sugerimos mejoras quepermitirían su uso en entornos HPC. Por lo tanto, presentamos directrices de diseño para futuras generaciones de SoCs móviles y sistemas HPC construidos a sualrededor, para permitir la construcción de una nueva clase de supercomputadores de coste reducido
NetFPGA: status, uses, developments, challenges, and evaluation
The constant growth of the Internet, driven by the demand for timely access to data center networks; has meant
that the technological platforms necessary to achieve this purpose are outside the current budgets. In this order to make and
validate relevant, timely and relevant contributions; it is necessary that a wider community, access to evaluation,
experimentation and demonstration environments with specifications that can be compared with existing networking
solutions. This article introduces the NetFPGA, which is a platform to develop network hardware for reconfigurable and
rapid prototyping. It’s introduces the application areas in high-performance networks, advantages for traffic analysis,
packet flow, hardware acceleration, power consumption and parallel processing in real time. Likewise, it presents the
advantages of the platform for research, education, innovation, and future trends of this platform. Finally, we present a
performance evaluation of the tool called OSNT (Open-Source Network Tester) and shows that OSNT has 95% accuracy
of timestamp with resolution of 10ns for the generation of TCP traffic, and 90% efficiency capturing packets at 10Gbps of
full line-rate
A cross-stack, network-centric architectural design for next-generation datacenters
This thesis proposes a full-stack, cross-layer datacenter architecture based on in-network computing and near-memory processing paradigms. The proposed datacenter architecture is built atop two principles: (1) utilizing commodity, off-the-shelf hardware (i.e., processor, DRAM, and network devices) with minimal changes to their architecture, and (2) providing a standard interface to the programmers for using the novel hardware. More specifically, the proposed datacenter architecture enables a smart network adapter to collectively compress/decompress data exchange between distributed DNN training nodes and assist the operating system in performing aggressive processor power management. It also deploys specialized memory modules in the servers, capable of performing general-purpose computation and network connectivity.
This thesis unlocks the potentials of hardware and operating system co-design in architecting application-transparent, near-data processing hardware for improving datacenter's performance, energy efficiency, and scalability. We evaluate the proposed datacenter architecture using a combination of full-system simulation, FPGA prototyping, and real-system experiments
- …