Search CORE

3 research outputs found

Virtualización mediante tecnología SR-IOV de tarjetas de red de altas prestaciones basadas en lógica programable

Author: Zazo Rollón José Fernándo
Publication venue
Publication date: 01/01/2015
Field of study

The IT industry has been taking the most demanding and rigorous standards as the reference in order to achieve stability, a high fidelity to protocols and a proper quality of the final product. Whilst this model may have been useful for the past, it is inevitable that time to market becomes a crucial bottleneck when developing custom hardware for network appliances. At this point, Network Function Virtualization (NFV) allows creating specialized solutions with general-purpose equipment. Broadly speaking, computing is transferred from the hardware layer to a CPU-based software. The main objective to treat is the exploration of the use of FPGAs, and its connectivity with the host system, as a feasible replacement for traditional hardware (switches, routers, etc.) in multigigabit networking environment. Own developments are disclosure under a free license as well as the underlying technologies are conscientiously tracked. From a DMA engine capable of ensuring the data transmission with rates above 40 gigabit per second (with measured peaks of over 50 Gbps), to the device controllers needed to interact with the system, are explained to the reader. The final reference platform consists of a network interface card (NIC) which involves as many virtual functions (VFs) as instantiated interfaces. The transmitted/received information by every abstract device is processed individually, in a transparent way to the developer, with destination/source the computer network. The key concept is known as SR-IOV, which accompanies by a FPGA, eases the virtualization of multiple functionalities. Independently, several instances of virtual machines may access to a VF exclusively thanks to PCI passthrough capabilities. The independency of the host station hardware, and the flexibility of the suggested framework, assure the user a notorious trade-off between performance and time production. The popular believing that high performance computing is confronted with virtualization heads to a wrong conclusion. In particular, an environment where the data is processed at 40 Gbps has been released. However, the subjacent virtualization support by the hardware platform (IOMMU) is limited and, in the system to card direction, the transferences suffer a pronounced bottleneck (10% of the performance of the native experiments) whilst this effect is palliated in the card to system direction (over 90% of the native results).La industria de las telecomunicaciones ha seguido estándares muy rigurosos que aseguren la estabilidad, fidelidad al protocolo y calidad de los productos desarrollados. Mientras que este modelo ha funcionado bien en el pasado, son inevitables unos ciclos de producción largos con un lento avance en el hardware especializado. Es en este punto, donde la virtualización permite generar equipos especializados con elementos de propósito general. Se traspasa parte de la computación desde un elemento puramente dedicado a la CPU del sistema (virtualización de funciones de red, NFV) concediendo una gran dinamicidad al entorno. El objetivo primordial es la exploración de la viabilidad del uso de FPGAs y la conectividad con el sistema anfitrión (basado en software) como sustituto para el hardware tradicional (switches, routers, etc.) en entornos multigigabit. Los desarrollos propios son liberados como contribuciones de licencia libre y las tecnologías subyacentes estudiadas en amplio detalle. Se implementa desde un motor de DMA que permita asegurar una tasa de transferencia sostenida para enlaces de 40 gigabits por segundo (mediciones tomadas por encima de 50 Gbps), hasta los controladores necesarios para la interacción con el dispositivo. La plataforma final de referencia consiste en una tarjeta de red con tantas funciones virtuales como interfaces existan. La información transmitida/recibida por cada dispositivo abstracto es tratada de manera independiente, transparente al desarrollador, con destino/origen final/inicial la red de ordenadores. La tecnología clave presentada para este proceso se conoce como SR-IOV, que acompañada por una única placa FPGA, facilita la simulación de múltiples periféricos dedicados. De manera independiente, distintas instancias de máquinas virtuales son capaces de hacer un uso exclusivo del dispositivo gracias a las capacidades de PCI passthrough, ofreciendo la falsa sensación de disponer de un recurso para su explotación individual. La independencia de la estación anfitriona, en cuanto a configuración hardware se refiere, y la marcada flexibilidad de los diseños, favorecen que esta arquitectura ofrezca un buen compromiso entre rendimiento y tiempo de puesta en mercado. Se desmiente la falsa creencia de que virtualización está reñida con procesamiento de alto rendimiento en todos los escenarios, aunque se han localizado carencias en el soporte por parte del hardware actual. En particular, la cantidad máxima de datos transferibles se ve limitada y aplicaciones que hagan uso intensivo en las comunicaciones hacia la tarjeta pueden verse gravemente afectadas (10% del rendimiento total en las pruebas generadas) si hacen uso de la virtualización. En la dirección inversa, un rendimiento superior al 90% ha sido probado

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Biblos-e Archivo

Network-Compute Co-Design for Distributed In-Memory Computing

Author: Daglis Alexandros
Publication venue: Lausanne, EPFL
Publication date: 13/09/2018
Field of study

The booming popularity of online services is rapidly raising the demands for modern datacenters. In order to cope with data deluge, growing user bases, and tight quality of service constraints, service providers deploy massive datacenters with tens to hundreds of thousands of servers, keeping petabytes of latency-critical data memory resident. Such data distribution and the multi-tiered nature of the software used by feature-rich services results in frequent inter-server communication and remote memory access over the network. Hence, networking takes center stage in datacenters. In response to growing internal datacenter network traffic, networking technology is rapidly evolving. Lean user-level protocols, like RDMA, and high-performance fabrics have started making their appearance, dramatically reducing datacenter-wide network latency and offering unprecedented per-server bandwidth. At the same time, the end of Dennard scaling is grinding processor performance improvements to a halt. The net result is a growing mismatch between the per-server network and compute capabilities: it will soon be difficult for a server processor to utilize all of its available network bandwidth. Restoring balance between network and compute capabilities requires tighter co-design of the two. The network interface (NI) is of particular interest, as it lies on the boundary of network and compute. In this thesis, we focus on the design of an NI for a lightweight RDMA-like protocol and its full integration with modern manycore server processors. The NI capabilities scale with both the increasing network bandwidth and the growing number of cores on modern server processors. Leveraging our architecture's integrated NI logic, we introduce new functionality at the network endpoints that yields performance improvements for distributed systems. Such additions include new network operations with stronger semantics tailored to common application requirements and integrated logic for balancing network load across a modern processor's multiple cores. We make the case that exposing richer, end-to-end semantics to the NI is a unique enabler for optimizations that can reduce software complexity and remove significant load from the processor, contributing towards maintaining balance between the two valuable resources of network and compute. Overall, network-compute co-design is an approach that addresses challenges associated with the emerging technological mismatch of compute and networking capabilities, yielding significant performance improvements for distributed memory systems

Infoscience - École polytechnique fédérale de Lausanne

Squeezing the most benefit from network parallelism in datacenters

Author: Ghorbani Khaledi Soudeh
Publication venue
Publication date: 01/12/2016
Field of study

One big non-blocking switch is one of the most powerful and pervasive abstractions in datacenter networking. As Moore's law begins to wane, using parallelism to scale out processing units, vs. scale them up, is becoming exceedingly popular. The one-big-switch abstraction, for example, is typically implemented via leveraging massive degrees of parallelism behind the scene. In particular, in today's datacenters that exhibit a high degree of multi-pathing, each logical path between a communicating pair in the one-big-switch abstraction is mapped to a set of paths that can carry traffic in parallel. Similarly, each one-big-switch abstraction function, such as the firewall functionality, is mapped to a set of distributed hardware and software switches. Efficiently deploying this pool of networking connectivity and preserving the functional correctness of network functions, in spite of the parallelism, are challenging. Efficiently balancing the load among multiple paths is challenging because microbursts, responsible for the majority of packet loss in datacenters today, usually last for only a few microseconds. Even the fastest traffic engineering schemes today have control loops that are several orders of magnitude slower (a few milliseconds to a few seconds), and are therefore ineffective in controlling microbursts. Correctly implementing network functions in the face of parallelism is hard because the distributed set of elements that in parallel implement a one-big-switch abstraction can inevitably have inconsistent states that may cause them to behave differently than one physical switch. The first part of this thesis presents DRILL, a datacenter fabric for Clos networks which performs micro load balancing to distribute load as evenly as possible on microsecond timescales. To achieve this, DRILL employs packet-level decisions at each switch based on local queue occupancies and randomized algorithms to distribute load. Despite making per-packet forwarding decisions, by enforcing a tight control on queue occupancies, DRILL manages to keep the degree of packet reordering low. DRILL adapts to topological asymmetry (e.g. failures) in Clos networks by decomposing the network into symmetric components. Using a detailed switch hardware model, we simulate DRILL and show it outperforms recent edge-based load balancers particularly in the tail latency under heavy load, e.g., under 80% load, it reduces the 99.99th percentile of flow completion times of Presto and CONGA by 32% and 35%, respectively. Finally, we analyze DRILL's stability and throughput-efficiency. In the second part, we focus on the correctness of one-big-switch abstraction's implementation. We first show that naively using parallelism to scale networking elements can cause incorrect behavior. For example, we show that an IDS system which operates correctly as a single network element can erroneously and permanently block hosts when it is replicated. We then provide a system, COCONUT, for seamless scale-out of network forwarding elements; that is, an SDN application programmer can program to what functionally appears to be a single forwarding element, but which may be replicated behind the scenes. To do this, we identify the key property for seamless scale out, weak causality, and guarantee it through a practical and scalable implementation of vector clocks in the data plane. We build a prototype of COCONUT and experimentally demonstrate its correct behavior. We also show that its abstraction enables a more efficient implementation of seamless scale-out compared to a naive baseline. Finally, reasoning about network behavior requires a new model that enables us to distinguish between observable and unobservable events. So in the last part, we present the Input/Output Automaton (IOA) model and formalize networks' behaviors. Using this framework, we prove that COCONUT enables seamless scale out of networking elements, i.e., the user-perceived behavior of any COCONUT element implemented with a distributed set of concurrent replicas is provably indistinguishable from its singleton implementation

Illinois Digital Environment for Access to Learning and Scholarship Repository