572 research outputs found
Overlapping of Communication and Computation and Early Binding: Fundamental Mechanisms for Improving Parallel Performance on Clusters of Workstations
This study considers software techniques for improving performance on clusters of workstations and approaches for designing message-passing middleware that facilitate scalable, parallel processing. Early binding and overlapping of communication and computation are identified as fundamental approaches for improving parallel performance and scalability on clusters. Currently, cluster computers using the Message-Passing Interface for interprocess communication are the predominant choice for building high-performance computing facilities, which makes the findings of this work relevant to a wide audience from the areas of high-performance computing and parallel processing. The performance-enhancing techniques studied in this work are presently underutilized in practice because of the lack of adequate support by existing message-passing libraries and are also rarely considered by parallel algorithm designers. Furthermore, commonly accepted methods for performance analysis and evaluation of parallel systems omit these techniques and focus primarily on more obvious communication characteristics such as latency and bandwidth. This study provides a theoretical framework for describing early binding and overlapping of communication and computation in models for parallel programming. This framework defines four new performance metrics that facilitate new approaches for performance analysis of parallel systems and algorithms. This dissertation provides experimental data that validate the correctness and accuracy of the performance analysis based on the new framework. The theoretical results of this performance analysis can be used by designers of parallel system and application software for assessing the quality of their implementations and for predicting the effective performance benefits of early binding and overlapping. This work presents MPI/Pro, a new MPI implementation that is specifically optimized for clusters of workstations interconnected with high-speed networks. This MPI implementation emphasizes features such as persistent communication, asynchronous processing, low processor overhead, and independent message progress. These features are identified as critical for delivering maximum performance to applications. The experimental section of this dissertation demonstrates the capability of MPI/Pro to facilitate software techniques that result in significant application performance improvements. Specific demonstrations with Virtual Interface Architecture and TCP/IP over Ethernet are offered
Enhancing HPC on Virtual Systems in Clouds through Optimizing Virtual Overlay Networks
Virtual Ethernet overlay provides a powerful model for realizing virtual distributed and parallel computing systems with strong isolation, portability, and recoverability properties. However, in extremely high throughput and low latency networks, such overlays can suffer from bandwidth and latency limitations, which is of particular concern in HPC environments. Through a careful and quantitative analysis, I iden- tify three core issues limiting performance: delayed and excessive virtual interrupt delivery into guests, copies between host and guest data buffers during encapsulation, and the semantic gap between virtual Ethernet features and underlying physical network features. I propose three novel optimizations in response: optimistic timer- free virtual interrupt injection, zero-copy cut-through data forwarding, and virtual TCP offload. These optimizations improve the latency and bandwidth of the overlay network on 10 Gbps Ethernet and InfiniBand interconnects, resulting in near-native performance for a wide range of microbenchmarks and MPI application benchmarks
CoRD: Converged RDMA Dataplane for High-Performance Clouds
High-performance networking is often characterized by kernel bypass which is
considered mandatory in high-performance parallel and distributed applications.
But kernel bypass comes at a price because it breaks the traditional OS
architecture, requiring applications to use special APIs and limiting the OS
control over existing network connections. We make the case, that kernel bypass
is not mandatory. Rather, high-performance networking relies on multiple
performance-improving techniques, with kernel bypass being the least effective.
CoRD removes kernel bypass from RDMA networks, enabling efficient OS-level
control over RDMA dataplane.Comment: 11 page
Enhancing speed and scalability of the ParFlow simulation code
Regional hydrology studies are often supported by high resolution simulations
of subsurface flow that require expensive and extensive computations. Efficient
usage of the latest high performance parallel computing systems becomes a
necessity. The simulation software ParFlow has been demonstrated to meet this
requirement and shown to have excellent solver scalability for up to 16,384
processes. In the present work we show that the code requires further
enhancements in order to fully take advantage of current petascale machines. We
identify ParFlow's way of parallelization of the computational mesh as a
central bottleneck. We propose to reorganize this subsystem using fast mesh
partition algorithms provided by the parallel adaptive mesh refinement library
p4est. We realize this in a minimally invasive manner by modifying selected
parts of the code to reinterpret the existing mesh data structures. We evaluate
the scaling performance of the modified version of ParFlow, demonstrating good
weak and strong scaling up to 458k cores of the Juqueen supercomputer, and test
an example application at large scale.Comment: The final publication is available at link.springer.co
Adolescent brain cognitive development (ABCD) study: Overview of substance use assessment methods.
One of the objectives of the Adolescent Brain Cognitive Development (ABCD) Study (https://abcdstudy.org/) is to establish a national longitudinal cohort of 9 and 10 year olds that will be followed for 10 years in order to prospectively study the risk and protective factors influencing substance use and its consequences, examine the impact of substance use on neurocognitive, health and psychosocial outcomes, and to understand the relationship between substance use and psychopathology. This article provides an overview of the ABCD Study Substance Use Workgroup, provides the goals for the workgroup, rationale for the substance use battery, and includes details on the substance use module methods and measurement tools used during baseline, 6-month and 1-year follow-up assessment time-points. Prospective, longitudinal assessment of these substance use domains over a period of ten years in a nationwide sample of youth presents an unprecedented opportunity to further understand the timing and interactive relationships between substance use and neurocognitive, health, and psychopathology outcomes in youth living in the United States
Cloud-efficient modelling and simulation of magnetic nano materials
Scientific simulations are rarely attempted in a cloud due to the substantial
performance costs of virtualization. Considerable communication overheads,
intolerable latencies, and inefficient hardware emulation are the main reasons why
this emerging technology has not been fully exploited. On the other hand, the
progress of computing infrastructure nowadays is strongly dependent on
perspective storage medium development, where efficient micromagnetic
simulations play a vital role in future memory design.
This thesis addresses both these topics by merging micromagnetic simulations
with the latest OpenStack cloud implementation while providing a time and costeffective alternative to expensive computing centers.
However, many challenges have to be addressed before a high-performance cloud
platform emerges as a solution for problems in micromagnetic research
communities. First, the best solver candidate has to be selected and further
improved, particularly in the parallelization and process communication domain.
Second, a 3-level cloud communication hierarchy needs to be recognized and
each segment adequately addressed. The required steps include breaking the VMisolation for the host’s shared memory activation, cloud network-stack tuning,
optimization, and efficient communication hardware integration.
The project work concludes with practical measurements and confirmation of
successfully implemented simulation into an open-source cloud environment. It is
achieved that the renewed Magpar solver runs for the first time in the OpenStack
cloud by using ivshmem for shared memory communication. Also, extensive
measurements proved the effectiveness of our solutions, yielding from sixty
percent to over ten times better results than those achieved in the standard cloud.Aufgrund der erheblichen Leistungskosten der Virtualisierung werden
wissenschaftliche Simulationen in einer Cloud selten versucht. Beträchtlicher
Kommunikationsaufwand, erhebliche Latenzen und ineffiziente
Hardwareemulation sind die HauptgrĂĽnde, warum diese aufkommende
Technologie nicht vollständig genutzt wurde. Andererseits hängt der Fortschritt der
Computertechnologie heutzutage stark von der Entwicklung perspektivischer
Speichermedien ab, bei denen effiziente mikromagnetische Simulationen eine
wichtige Rolle fĂĽr die zukĂĽnftige Speichertechnologie spielen.
Diese Arbeit befasst sich mit diesen beiden Themen, indem mikromagnetische
Simulationen mit der neuesten OpenStack Cloud-Implementierung
zusammengefĂĽhrt werden, um eine zeit- und kostengĂĽnstige Alternative zu teuren
Rechenzentren bereitzustellen.
Viele Herausforderungen mĂĽssen jedoch angegangen werden, bevor eine
leistungsstarke Cloud-Plattform als Lösung für Probleme in mikromagnetischen
Forschungsgemeinschaften entsteht. Zunächst muss der beste Kandidat für die
Lösung ausgewählt und weiter verbessert werden, insbesondere im Bereich der
Parallelisierung und Prozesskommunikation. Zweitens muss eine 3-stufige CloudKommunikationshierarchie erkannt und jedes Segment angemessen adressiert
werden. Die erforderlichen Schritte umfassen das Aufheben der VM-Isolation, um
den gemeinsam genutzten Speicher zwischen Cloud-Instanzen zu aktivieren, die
Optimierung des Cloud-Netzwerkstapels und die effiziente Integration von
Kommunikationshardware.
Die praktische Arbeit endet mit Messungen und der Bestätigung einer erfolgreich
implementierten Simulation in einer Open-Source Cloud-Umgebung. Als Ergebnis
haben wir erreicht, dass der neu erstellte Magpar-Solver zum ersten Mal in der
OpenStack Cloud ausgefĂĽhrt wird, indem ivshmem fĂĽr die Shared-Memory
Kommunikation verwendet wird. Umfangreiche Messungen haben auch die
Wirksamkeit unserer Lösungen bewiesen und von sechzig Prozent bis zu zehnmal
besseren Ergebnissen als in der Standard Cloud gefĂĽhrt
ATCOM: Automatically tuned collective communication system for SMP clusters.
Conventional implementations of collective communications are based on point-to-point communications, and their optimizations have been focused on efficiency of those communication algorithms. However, point-to-point communications are not the optimal choice for modern computing clusters of SMPs due to their two-level communication structure. In recent years, a few research efforts have investigated efficient collective communications for SMP clusters. This dissertation is focused on platform-independent algorithms and implementations in this area;There are two main approaches to implementing efficient collective communications for clusters of SMPs: using shared memory operations for intra-node communications, and over-lapping inter-node/intra-node communications. The former fully utilizes the hardware based shared memory of an SMP, and the latter takes advantage of the inherent hierarchy of the communications within a cluster of SMPs. Previous studies focused on clusters of SMP from certain vendors. However, the previously proposed methods are not portable to other systems. Because the performance optimization issue is very complicated and the developing process is very time consuming, it is highly desired to have self-tuning, platform-independent implementations. As proven in this dissertation, such an implementation can significantly outperform the other point-to-point based portable implementations and some platform-specific implementations;The dissertation describes in detail the architecture of the platform-independent implementation. There are four system components: shared memory-based collective communications, overlapping mechanisms for inter-node and intra-node communications, a prediction-based tuning module and a micro-benchmark based tuning module. Each component is carefully designed with the goal of automatic tuning in mind
- …