209 research outputs found
NIC-assisted cache-efficient receive stack for message passing over Ethernet
International audienceHigh-speed networking in clusters usually relies on advanced hardware features in the NICs, such as zero-copy capability. Open-MX is a high-performance message passing stack tailored for regular Ethernet hardware without such capabilities. We present the addition of a multiqueue support in the Open-MX receive stack so that all incoming packets for the same process are handled on the same core. We then introduce the idea of binding the target end process near its dedicated receive queue. This model leads to a more cache-efficient receive stack for Open-MX. It also proves that very simple and stateless hardware features may have a significant impact on message passing performance over Ethernet. The implementation of this model in a firmware reveals that it may not be as efficient as some manually tuned micro-benchmarks. But our multiqueue receive stack generally performs better than the original single queue stack, especially on large communication patterns where multiple processes are involved and manual binding is difficult
NIC-assisted Cache-Efficient Receive Stack for Message Passing over Ethernet
International audienceHigh-speed networking in clusters usually relies on advanced hardware features in the NICs, such as zero-copy. Open-MX is a high-performance message passing stack designed for regular Ethernet hardware without such capabilities. We present the addition of multiqueue support in the Open-MX receive stack so that all incoming packets for the same process are treated on the same core. We then introduce the idea of binding the target end process near its dedicated receive queue. This model leads to a more cache-efficient receive stack for Open-MX. It also proves that very simple and stateless hardware features may have a significant impact on message passing performance over Ethernet. The implementation of this model in a firmware reveals that it may not be as efficient as some manually tuned micro-benchmarks. But our multiqueue receive stack generally performs better than the original single queue stack, especially on large communication patterns where multiple processes are involved and manual binding is difficult
High-Performance Message Passing over generic Ethernet Hardware with Open-MX
International audienceIn the last decade, cluster computing has become the most popular high-performance computing architecture. Although numerous technological innovations have been proposed to improve the interconnection of nodes, many clusters still rely on commodity Ethernet hardware to implement message passing within parallel applications. We present Open-MX, an open-source message passing stack over generic Ethernet. It offers the same abilities as the specialized Myrinet Express stack, without requiring dedicated support from the networking hardware. Open-MX works transparently in the most popular MPI implementations through its MX interface compatibility. It also enables interoperability between hosts running the specialized MX stack and generic Ethernet hosts. We detail how Open-MX copes with the inherent limitations of the Ethernet hardware to satisfy the requirements of message passing by applying an innovative copy offload model. Combined with a careful tuning of the fabric and of the MX wire protocol, Open-MX achieves better performance than TCP implementations, especially on 10 gigabit/s hardware
Datacenter Traffic Control: Understanding Techniques and Trade-offs
Datacenters provide cost-effective and flexible access to scalable compute
and storage resources necessary for today's cloud computing needs. A typical
datacenter is made up of thousands of servers connected with a large network
and usually managed by one operator. To provide quality access to the variety
of applications and services hosted on datacenters and maximize performance, it
deems necessary to use datacenter networks effectively and efficiently.
Datacenter traffic is often a mix of several classes with different priorities
and requirements. This includes user-generated interactive traffic, traffic
with deadlines, and long-running traffic. To this end, custom transport
protocols and traffic management techniques have been developed to improve
datacenter network performance.
In this tutorial paper, we review the general architecture of datacenter
networks, various topologies proposed for them, their traffic properties,
general traffic control challenges in datacenters and general traffic control
objectives. The purpose of this paper is to bring out the important
characteristics of traffic control in datacenters and not to survey all
existing solutions (as it is virtually impossible due to massive body of
existing research). We hope to provide readers with a wide range of options and
factors while considering a variety of traffic control mechanisms. We discuss
various characteristics of datacenter traffic control including management
schemes, transmission control, traffic shaping, prioritization, load balancing,
multipathing, and traffic scheduling. Next, we point to several open challenges
as well as new and interesting networking paradigms. At the end of this paper,
we briefly review inter-datacenter networks that connect geographically
dispersed datacenters which have been receiving increasing attention recently
and pose interesting and novel research problems.Comment: Accepted for Publication in IEEE Communications Surveys and Tutorial
Acceleration of the hardware-software interface of a communication device for parallel systems
During the last decades the ever growing need for computational power fostered the development of parallel computer architectures. Applications need to be parallelized and optimized to be able to exploit modern system architectures. Today, scalability of applications is more and more limited both by development resources, as programming of complex parallel applications becomes increasingly demanding, and by the fundamental scalability issues introduced by the cost of communication in distributed memory systems. Lowering the latency of communication is mandatory to increase scalability and serves as an enabling technology for programming of distributed memory systems at a higher abstraction layer using higher degrees of compiler driven automation. At the same time it can increase performance of such systems in general. In this work, the software/hardware interface and the network interface controller functions of the EXTOLL network architecture, which is specifically designed to satisfy the needs of low-latency networking for high-performance computing, is presented. Several new architectural contributions are made in this thesis, namely a new efficient method for virtual-tophysical address-translation named ATU and a novel method to issue operations to a virtual device in an optimal way which has been termed Transactional I/O. This new method needs changes in the architecture of the host CPU the device is connected to. Two additional methods that emulate most of the characteristics of Transactional I/O are developed and employed in the development of the EXTOLL hardware to facilitate usage together with contemporary CPUs. These new methods heavily leverage properties of the HyperTransport interface used to connect the device to the CPU. Finally, this thesis also introduces an optimized remote-memory-access architecture for efficient split-phase transactions and atomic operations. The complete architecture has been prototyped using FPGA technology enabling a more precise analysis and verification than is possible using simulation alone. The resulting design utilizes 95 % of a 90 nm FPGA device and reaches speeds of 200 MHz and 156 MHz in the different clock domains of the design. The EXTOLL software stack is developed and a performance evaluation of the software using the EXTOLL hardware is performed. The performance evaluation shows an excellent start-up latency value of 1.3 μs, which competes with the most advanced networks available, in spite of the technological performance handicap encountered by FPGA technology. The resulting network is, to the best of the knowledge of the author, the fastest FPGA-based interconnection network for commodity processors ever built
Virtualization services: scalable methods for virtualizing multicore systems
Multi-core technology is bringing parallel processing capabilities
from servers to laptops and even handheld devices. At the same time,
platform support for system virtualization is making it easier to
consolidate server and client resources, when and as needed by
applications. This consolidation is achieved by dynamically mapping
the virtual machines on which applications run to underlying
physical machines and their processing cores. Low cost processor and
I/O virtualization methods efficiently scaled to different numbers of
processing cores and I/O devices are key enablers of such consolidation.
This dissertation develops and evaluates new methods for scaling
virtualization functionality to multi-core and future many-core systems.
Specifically, it re-architects virtualization functionality to improve
scalability and better exploit multi-core system resources. Results
from this work include a self-virtualized I/O abstraction, which
virtualizes I/O so as to flexibly use different platforms' processing
and I/O resources. Flexibility affords improved performance and resource
usage and most importantly, better scalability than that offered by
current I/O virtualization solutions. Further, by describing system virtualization as a
service provided to virtual machines and the underlying computing platform,
this service can be enhanced to provide new and innovative functionality.
For example, a virtual device may provide obfuscated data to guest operating
systems to maintain data privacy; it could mask differences in device
APIs or properties to deal with heterogeneous underlying resources; or it
could control access to data based on the ``trust' properties of the
guest VM.
This thesis demonstrates that extended virtualization services are
superior to existing operating system or user-level implementations
of such functionality, for multiple reasons. First, this solution
technique makes more efficient use of key performance-limiting resource in
multi-core systems, which are memory and I/O bandwidth. Second, this
solution technique better exploits the parallelism inherent in multi-core
architectures and exhibits good scalability properties, in
part because at the hypervisor level, there is greater control in precisely
which and how resources are used to realize extended virtualization services.
Improved control over resource usage makes it possible to provide
value-added functionalities for both guest VMs and the platform.
Specific instances of virtualization services described in this thesis are the
network virtualization service that exploits heterogeneous processing cores,
a storage virtualization service that provides location transparent access
to block devices by extending
the functionality provided by network virtualization service, a multimedia
virtualization service that allows efficient media device sharing based on semantic
information, and an object-based storage service with enhanced access
control.Ph.D.Committee Chair: Schwan, Karsten; Committee Member: Ahamad, Mustaq; Committee Member: Fujimoto, Richard; Committee Member: Gavrilovska, Ada; Committee Member: Owen, Henry; Committee Member: Xenidis, Jim
Cloud-efficient modelling and simulation of magnetic nano materials
Scientific simulations are rarely attempted in a cloud due to the substantial
performance costs of virtualization. Considerable communication overheads,
intolerable latencies, and inefficient hardware emulation are the main reasons why
this emerging technology has not been fully exploited. On the other hand, the
progress of computing infrastructure nowadays is strongly dependent on
perspective storage medium development, where efficient micromagnetic
simulations play a vital role in future memory design.
This thesis addresses both these topics by merging micromagnetic simulations
with the latest OpenStack cloud implementation while providing a time and costeffective alternative to expensive computing centers.
However, many challenges have to be addressed before a high-performance cloud
platform emerges as a solution for problems in micromagnetic research
communities. First, the best solver candidate has to be selected and further
improved, particularly in the parallelization and process communication domain.
Second, a 3-level cloud communication hierarchy needs to be recognized and
each segment adequately addressed. The required steps include breaking the VMisolation for the host’s shared memory activation, cloud network-stack tuning,
optimization, and efficient communication hardware integration.
The project work concludes with practical measurements and confirmation of
successfully implemented simulation into an open-source cloud environment. It is
achieved that the renewed Magpar solver runs for the first time in the OpenStack
cloud by using ivshmem for shared memory communication. Also, extensive
measurements proved the effectiveness of our solutions, yielding from sixty
percent to over ten times better results than those achieved in the standard cloud.Aufgrund der erheblichen Leistungskosten der Virtualisierung werden
wissenschaftliche Simulationen in einer Cloud selten versucht. Beträchtlicher
Kommunikationsaufwand, erhebliche Latenzen und ineffiziente
Hardwareemulation sind die Hauptgründe, warum diese aufkommende
Technologie nicht vollständig genutzt wurde. Andererseits hängt der Fortschritt der
Computertechnologie heutzutage stark von der Entwicklung perspektivischer
Speichermedien ab, bei denen effiziente mikromagnetische Simulationen eine
wichtige Rolle für die zukünftige Speichertechnologie spielen.
Diese Arbeit befasst sich mit diesen beiden Themen, indem mikromagnetische
Simulationen mit der neuesten OpenStack Cloud-Implementierung
zusammengeführt werden, um eine zeit- und kostengünstige Alternative zu teuren
Rechenzentren bereitzustellen.
Viele Herausforderungen müssen jedoch angegangen werden, bevor eine
leistungsstarke Cloud-Plattform als Lösung für Probleme in mikromagnetischen
Forschungsgemeinschaften entsteht. Zunächst muss der beste Kandidat für die
Lösung ausgewählt und weiter verbessert werden, insbesondere im Bereich der
Parallelisierung und Prozesskommunikation. Zweitens muss eine 3-stufige CloudKommunikationshierarchie erkannt und jedes Segment angemessen adressiert
werden. Die erforderlichen Schritte umfassen das Aufheben der VM-Isolation, um
den gemeinsam genutzten Speicher zwischen Cloud-Instanzen zu aktivieren, die
Optimierung des Cloud-Netzwerkstapels und die effiziente Integration von
Kommunikationshardware.
Die praktische Arbeit endet mit Messungen und der Bestätigung einer erfolgreich
implementierten Simulation in einer Open-Source Cloud-Umgebung. Als Ergebnis
haben wir erreicht, dass der neu erstellte Magpar-Solver zum ersten Mal in der
OpenStack Cloud ausgeführt wird, indem ivshmem für die Shared-Memory
Kommunikation verwendet wird. Umfangreiche Messungen haben auch die
Wirksamkeit unserer Lösungen bewiesen und von sechzig Prozent bis zu zehnmal
besseren Ergebnissen als in der Standard Cloud geführt
Leveraging virtualization technologies for resource partitioning in mixed criticality systems
Multi- and many-core processors are becoming increasingly popular in embedded systems. Many of these processors now feature hardware virtualization capabilities, such as the ARM Cortex A15, and x86 processors with Intel VT-x or AMD-V support. Hardware virtualization offers opportunities to partition physical resources, including processor cores, memory and I/O devices amongst guest virtual machines. Mixed criticality systems and services can then co-exist on the same platform in separate virtual machines. However, traditional virtual machine systems are too expensive because of the costs of trapping into hypervisors to multiplex and manage machine physical resources on behalf of separate guests. For example, hypervisors are needed to schedule separate VMs on physical processor cores. Additionally, traditional hypervisors have memory footprints that are often too large for many embedded computing systems. This dissertation presents the design of the Quest-V separation kernel, which partitions services of different criticality levels across separate virtual machines, or sandboxes. Each sandbox encapsulates a subset of machine physical resources that it manages without requiring intervention of a hypervisor. In Quest-V, a hypervisor is not needed for normal operation, except to bootstrap the system and establish communication channels between sandboxes. This approach not only reduces the memory footprint of the most privileged protection domain, it removes it from the control path during normal system operation, thereby heightening security
Scale-out NUMA
Emerging datacenter applications operate on vast datasets that are kept in DRAM to minimize latency. The large number of servers needed to accommodate this massive memory footprint requires frequent server-to-server communication in applications such as key-value stores and graph-based applications that rely on large irregular data structures. The fine-grained nature of the accesses is a poor match to commodity networking technologies, including RDMA, which incur delays of 10-1000x over local DRAM operations. We introduce Scale-Out NUMA (soNUMA) – an architecture, programming model, and communication protocol for low-latency, distributed in-memory processing. soNUMA layers an RDMA-inspired programming model directly on top of a NUMA memory fabric via a stateless messaging protocol. To facilitate interactions between the application, OS, and the fabric, soNUMA relies on the remote memory controller – a new architecturally-exposed hardware block integrated into the node’s local coherence hierarchy. Our results based on cycle-accurate full-system simulation show that soNUMA performs remote reads at latencies that are within 4x of local DRAM, can fully utilize the available memory bandwidth, and can issue up to 10M remote memory operations per second per core
- …