Search CORE

20 research outputs found

Nonblocking collectives for scalable Java communications

Author: Brightwell
Buntinas
Ekman
Expósito
Hoefler
Kandalla
Nishtala
Ramos
Shafi
Taboada
Taboada
Taboada
Publication venue: 'Wiley'
Publication date: 01/01/2015
Field of study

This is the peer reviewed version of the following article: Ramos, S., Taboada, G. L., Expósito, R. R., & Touriño, J. (2015). Nonblocking collectives for scalable Java communications. Concurrency and Computation: Practice and Experience, 27(5), 1169-1187, which has been published in final form at https://doi.org/10.1002/cpe.3279. This article may be used for non-commercial purposes in accordance with Wiley Terms and Conditions for Use of Self-Archived Versions.[Abstract] This paper presents a Java implementation of the recently published MPI 3.0 nonblocking message passing collectives in order to analyze and assess the feasibility of taking advantage of these operations in shared memory systems using Java. Nonblocking collectives aim to exploit the overlapping between computation and communication for collective operations to increase scalability of message passing codes, as it has been carried out for nonblocking point‐to‐point primitives. This scalability has become crucial not only for clusters but also for shared memory systems because of the current trend of increasing the number of cores per chip, which is leading to the generalization of multi‐core and many‐core processors. Message passing libraries based on remote direct memory access, thread‐based progression, or implementing pure multi‐threading shared memory support could potentially benefit from the lack of imposed synchronization by nonblocking collectives. But, although the distributed memory scenario has been well studied, the shared memory one has not been tackled yet. Hence, nonblocking collectives support has been included in FastMPJ, a Message Passing in Java (MPJ) implementation, and evaluated on a representative shared memory system, obtaining significant improvements because of overlapping and lack of implicit synchronization, and with barely any overhead imposed over common blocking operations.Ministerio de Ciencia e Innovación; TIN2010-16735Xunta de Galicia; CN2012/211Xunta de Galicia; GRC2013/05

Repositorio da Universidade da Coruña

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Crossref

High performance Java for multi-core systems

Author: Ramos Garea Sabela
Publication venue
Publication date: 01/01/2013
Field of study

[Abstract] The interest in Java within the High Performance Computing (HPC) community has been rising during the last years thanks to its noticeable performance improvements and its productivity features. In a context where the trend to increase the number of cores per processor is leading to the generalization of many-core processors and accelerators, multithreading as an inherent feature of the language makes Java extremely interesting to exploit the performance provided by multi- and manycore architectures. This PhD Thesis presents a thorough analysis of the current state of the art regarding multi- and many-core programming in Java and provides the design, implementation and evaluation of several solutions to enable Java for the many-core era. To achieve this, a shared memory message-passing solution has been implemented to provide shared memory programming with the scalability of distributed memory paradigms, also with the benefits of a portable programming model that allows the developed codes to be run on distributed memory systems. Moreover, representative collective operations, involving computation and communication among different processes or threads, have been optimized, also introducing in Java new features for scalability from the MPI 3.0 specification, namely nonblocking collectives. Regarding the exploitation of many-core architectures, the lack of direct Java support forces to resort to wrappers or higher-level solutions to translate Java code into CUDA or OpenCL. The most relevant among these solutions have been evaluated and thoroughly analyzed in terms of performance and productivity. Guidelines for taking advantage of shared memory environments have been derived during the analysis and development of the proposed solutions, and the main conclusion is that the use of Java for shared memory programming on multi- and many-core systems is not only productive but also can provide high performance competitive results. However, in order to effectively take advantage of the underlying multi- and many-core architectures, the key is the availability of optimized middleware that abstracts multithreading details from the user, like the one proposed in this Thesis, and the optimization of common operations like collective communications

Repositorio da Universidade da Coruña

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Runtime MPI Correctness Checking with a Scalable Tools Infrastructure

Author: Hilbrich Tobias
Publication venue
Publication date: 08/06/2015
Field of study

Increasing computational demand of simulations motivates the use of parallel computing systems. At the same time, this parallelism poses challenges to application developers. The Message Passing Interface (MPI) is a de-facto standard for distributed memory programming in high performance computing. However, its use also enables complex parallel programing errors such as races, communication errors, and deadlocks. Automatic tools can assist application developers in the detection and removal of such errors. This thesis considers tools that detect such errors during an application run and advances them towards a combination of both precise checks (neither false positives nor false negatives) and scalability. This includes novel hierarchical checks that provide scalability, as well as a formal basis for a distributed deadlock detection approach. At the same time, the development of parallel runtime tools is challenging and time consuming, especially if scalability and portability are key design goals. Current tool development projects often create similar tool components, while component reuse remains low. To provide a perspective towards more efficient tool development, which simplifies scalable implementations, component reuse, and tool integration, this thesis proposes an abstraction for a parallel tools infrastructure along with a prototype implementation. This abstraction overcomes the use of multiple interfaces for different types of tool functionality, which limit flexible component reuse. Thus, this thesis advances runtime error detection tools and uses their redesign and their increased scalability requirements to apply and evaluate a novel tool infrastructure abstraction. The new abstraction ultimately allows developers to focus on their tool functionality, rather than on developing or integrating common tool components. The use of such an abstraction in wide ranges of parallel runtime tool development projects could greatly increase component reuse. Thus, decreasing tool development time and cost. An application study with up to 16,384 application processes demonstrates the applicability of both the proposed runtime correctness concepts and of the proposed tools infrastructure

Technische Universität Dresden: Qucosa

Design and Evaluation of Low-Latency Communication Middleware on High Performance Computing Systems

Author: Expósito Roberto R.
Publication venue
Publication date: 01/01/2014
Field of study

[Resumen]El interés en Java para computación paralela está motivado por sus interesantes características, tales como su soporte multithread, portabilidad, facilidad de aprendizaje,alta productividad y el aumento significativo en su rendimiento omputacional. No obstante, las aplicaciones paralelas en Java carecen generalmente de mecanismos de comunicación eficientes, los cuales utilizan a menudo protocolos basados en sockets incapaces de obtener el máximo provecho de las redes de baja latencia, obstaculizando la adopción de Java en computación de altas prestaciones (High Per- formance Computing, HPC). Esta Tesis Doctoral presenta el diseño, implementación y evaluación de soluciones de comunicación en Java que superan esta limitación. En consecuencia, se desarrollaron múltiples dispositivos de comunicación a bajo nivel para paso de mensajes en Java (Message-Passing in Java, MPJ) que aprovechan al máximo el hardware de red subyacente mediante operaciones de acceso directo a memoria remota que proporcionan comunicaciones de baja latencia. También se incluye una biblioteca de paso de mensajes en Java totalmente funcional, FastMPJ, en la cual se integraron los dispositivos de comunicación. La evaluación experimental ha mostrado que las primitivas de comunicación de FastMPJ son competitivas en comparación con bibliotecas nativas, aumentando significativamente la escalabilidad de aplicaciones MPJ. Por otro lado, esta Tesis analiza el potencial de la computación en la nube (cloud computing) para HPC, donde el modelo de distribución de infraestructura como servicio (Infrastructure as a Service, IaaS) emerge como una alternativa viable a los sistemas HPC tradicionales. La evaluación del rendimiento de recursos cloud específicos para HPC del proveedor líder, Amazon EC2, ha puesto de manifiesto el impacto significativo que la virtualización impone en la red, impidiendo mover las aplicaciones intensivas en comunicaciones a la nube. La clave reside en un soporte de virtualización apropiado, como el acceso directo al hardware de red, junto con las directrices para la optimización del rendimiento sugeridas en esta Tesis.[Resumo]O interese en Java para computación paralela está motivado polas súas interesantes características, tales como o seu apoio multithread, portabilidade, facilidade de aprendizaxe, alta produtividade e o aumento signi cativo no seu rendemento computacional. No entanto, as aplicacións paralelas en Java carecen xeralmente de mecanismos de comunicación e cientes, os cales adoitan usar protocolos baseados en sockets que son incapaces de obter o máximo proveito das redes de baixa latencia, obstaculizando a adopción de Java na computación de altas prestacións (High Performance Computing, HPC). Esta Tese de Doutoramento presenta o deseño, implementaci ón e avaliación de solucións de comunicación en Java que superan esta limitación. En consecuencia, desenvolvéronse múltiples dispositivos de comunicación a baixo nivel para paso de mensaxes en Java (Message-Passing in Java, MPJ) que aproveitan ao máaximo o hardware de rede subxacente mediante operacións de acceso directo a memoria remota que proporcionan comunicacións de baixa latencia. Tamén se inclúe unha biblioteca de paso de mensaxes en Java totalmente funcional, FastMPJ, na cal foron integrados os dispositivos de comunicación. A avaliación experimental amosou que as primitivas de comunicación de FastMPJ son competitivas en comparación con bibliotecas nativas, aumentando signi cativamente a escalabilidade de aplicacións MPJ. Por outra banda, esta Tese analiza o potencial da computación na nube (cloud computing) para HPC, onde o modelo de distribución de infraestrutura como servizo (Infrastructure as a Service, IaaS) xorde como unha alternativa viable aos sistemas HPC tradicionais. A ampla avaliación do rendemento de recursos cloud específi cos para HPC do proveedor líder, Amazon EC2, puxo de manifesto o impacto signi ficativo que a virtualización impón na rede, impedindo mover as aplicacións intensivas en comunicacións á nube. A clave atópase no soporte de virtualización apropiado, como o acceso directo ao hardware de rede, xunto coas directrices para a optimización do rendemento suxeridas nesta Tese.[Abstract]The use of Java for parallel computing is becoming more promising owing to its appealing features, particularly its multithreading support, portability, easy-tolearn properties, high programming productivity and the noticeable improvement in its computational performance. However, parallel Java applications generally su er from inefficient communication middleware, most of which use socket-based protocols that are unable to take full advantage of high-speed networks, hindering the adoption of Java in the High Performance Computing (HPC) area. This PhD Thesis presents the design, development and evaluation of scalable Java communication solutions that overcome these constraints. Hence, we have implemented several lowlevel message-passing devices that fully exploit the underlying network hardware while taking advantage of Remote Direct Memory Access (RDMA) operations to provide low-latency communications. Moreover, we have developed a productionquality Java message-passing middleware, FastMPJ, in which the devices have been integrated seamlessly, thus allowing the productive development of Message-Passing in Java (MPJ) applications. The performance evaluation has shown that FastMPJ communication primitives are competitive with native message-passing libraries, improving signi cantly the scalability of MPJ applications. Furthermore, this Thesis has analyzed the potential of cloud computing towards spreading the outreach of HPC, where Infrastructure as a Service (IaaS) o erings have emerged as a feasible alternative to traditional HPC systems. Several cloud resources from the leading IaaS provider, Amazon EC2, which speci cally target HPC workloads, have been thoroughly assessed. The experimental results have shown the signi cant impact that virtualized environments still have on network performance, which hampers porting communication-intensive codes to the cloud. The key is the availability of the proper virtualization support, such as the direct access to the network hardware, along with the guidelines for performance optimization suggested in this Thesis

Repositorio da Universidade da Coruña

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Analyse statique/dynamique pour la validation et l'amélioration des applications parallèles multi-modèles

Author: Saillard Emmanuelle
Publication venue: HAL CCSD
Publication date: 24/09/2015
Field of study

Supercomputing plays an important role in several innovative fields, speeding up prototyping or validating scientific theories. However, supercomputers are evolving rapidly with now millions of processing units, posing the questions of their programmability. Despite the emergence of more widespread and functional parallel programming models, developing correct and effective parallel applications still remains a complex task. Although debugging solutions have emerged to address this issue, they often come with restrictions. However programming model evolutions stress the requirement for a convenient validation tool able to handle hybrid applications. Indeed as current scientific applications mainly rely on the Message Passing Interface (MPI) parallel programming model, new hardwares designed for Exascale with higher node-level parallelism clearly advocate for an MPI+X solutions with X a thread-based model such as OpenMP. But integrating two different programming models inside the same application can be error-prone leading to complex bugs - mostly detected unfortunately at runtime. In an MPI+X program not only the correctness of MPI should be ensured but also its interactions with the multi-threaded model, for example identical MPI collective operations cannot be performed by multiple nonsynchronized threads. This thesis aims at developing a combination of static and dynamic analysis to enable an early verification of hybrid HPC applications. The first pass statically verifies the thread level required by an MPI+OpenMP application and outlines execution paths leading to potential deadlocks. Thanks to this analysis, the code is selectively instrumented, displaying an error and synchronously interrupting all processes if the actual scheduling leads to a deadlock situation.L’utilisation du parallélisme des architectures actuelles dans le domaine du calcul hautes performances, oblige à recourir à différents langages parallèles. Ainsi, l’utilisation conjointe de MPI pour le parallélisme gros grain, à mémoire distribuée et OpenMP pour du parallélisme de thread, fait partie des pratiques de développement d’applications pour supercalculateurs. Des erreurs, liées à l’utilisation conjointe de ces langages de parallélisme, sont actuellement difficiles à détecter et cela limite l’écriture de codes, permettant des interactions plus poussées entre ces niveaux de parallélisme. Des outils ont été proposés afin de palier ce problème. Cependant, ces outils sont généralement focalisés sur un type de modèle et permettent une vérification dite statique (à la compilation) ou dynamique (à l’exécution). Pourtant une combinaison statique/- dynamique donnerait des informations plus pertinentes. En effet, le compilateur est en mesure de donner des informations relatives au comportement général du code, indépendamment du jeu d’entrée. C’est par exemple le cas des problèmes liés aux communications collectives du modèle MPI. Cette thèse a pour objectif de développer des analyses statiques/dynamiques permettant la vérification d’une application parallèle mélangeant plusieurs modèles de programmation, afin de diriger les développeurs vers un code parallèle multi-modèles correct et performant. La vérification se fait en deux étapes. Premièrement, de potentielles erreurs sont détectées lors de la phase de compilation. Ensuite, un test au runtime est ajouté pour savoir si le problème va réellement se produire. Grâce à ces analyses combinées, nous renvoyons des messages précis aux utilisateurs et évitons les situations de blocage

Thèses en Ligne

Theses.fr

Partial aggregation for collective communication in distributed memory machines

Author: Kowalewski Roger
Publication venue: Ludwig-Maximilians-Universität München
Publication date: 03/08/2021
Field of study

High Performance Computing (HPC) systems interconnect a large number of Processing Elements (PEs) in high-bandwidth networks to simulate complex scientific problems. The increasing scale of HPC systems poses great challenges on algorithm designers. As the average distance between PEs increases, data movement across hierarchical memory subsystems introduces high latency. Minimizing latency is particularly challenging in collective communications, where many PEs may interact in complex communication patterns. Although collective communications can be optimized for network-level parallelism, occasional synchronization delays due to dependencies in the communication pattern degrade application performance. To reduce the performance impact of communication and synchronization costs, parallel algorithms are designed with sophisticated latency hiding techniques. The principle is to interleave computation with asynchronous communication, which increases the overall occupancy of compute cores. However, collective communication primitives abstract parallelism which limits the integration of latency hiding techniques. Approaches to work around these limitations either modify the algorithmic structure of application codes, or replace collective primitives with verbose low-level communication calls. While these approaches give fine-grained control for latency hiding, implementing collective communication algorithms is challenging and requires expertise knowledge about HPC network topologies. A collective communication pattern is commonly described as a Directed Acyclic Graph (DAG) where a set of PEs, represented as vertices, resolve data dependencies through communication along the edges. Our approach improves latency hiding in collective communication through partial aggregation. Based on mathematical rules of binary operations and homomorphism, we expose data parallelism in a respective DAG to overlap computation with communication. The proposed concepts are implemented and evaluated with a subset of collective primitives in the Message Passing Interface (MPI), an established communication standard in scientific computing. An experimental analysis with communication-bound microbenchmarks shows considerable performance benefits for the evaluated collective primitives. A detailed case study with a large-scale distributed sort algorithm demonstrates, how partial aggregation significantly improves performance in data-intensive scenarios. Besides better latency hiding capabilities with collective communication primitives, our approach enables further optimizations of their implementations within MPI libraries. The vast amount of asynchronous programming models, which are actively studied in the HPC community, benefit from partial aggregation in collective communication patterns. Future work can utilize partial aggregation to improve the interaction of MPI collectives with acclerator architectures, and to design more efficient communication algorithms

Digitale Hochschulschriften der LMU

MPI + MPI: a new hybrid approach to parallel programming with MPI plus shared memory

Author: Balaji Pavan
Barrett Brian
Brightwell Ron
Buntinas Darius
Dinan James
Gropp William
Hoefler Torsten
Kale Vivek
Thakur Rajeev
Publication venue
Publication date: 18/06/2018
Field of study

Hybrid parallel programming with the message passing interface (MPI) for internode communication in conjunction with a shared-memory programming model to manage intranode parallelism has become a dominant approach to scalable parallel programming. While this model provides a great deal of flexibility and performance potential, it saddles programmers with the complexity of utilizing two parallel programming systems in the same application. We introduce an MPI-integrated shared-memory programming model that is incorporated into MPI through a small extension to the one-sided communication interface. We discuss the integration of this interface with the MPI 3.0 one-sided semantics and describe solutions for providing portable and efficient data sharing, atomic operations, and memory consistency. We describe an implementation of the new interface in the MPICH2 and Open MPI implementations and demonstrate an average performance improvement of 40% to the communication component of a five-point stencil solve

Repository for Publications and Research Data

RERO DOC Digital Library

Proceedings of the 7th International Conference on PGAS Programming Models

Author
Publication venue: University of Edinburgh
Publication date: 29/10/2013
Field of study

Edinburgh Research Explorer

Toward Message Passing Failure Management

Author: Bland Wesley B.
Publication venue: TRACE: Tennessee Research and Creative Exchange
Publication date: 01/01/2013
Field of study

As machine sizes have increased and application runtimes have lengthened, research into fault tolerance has evolved alongside. Moving from result checking, to rollback recovery, and to algorithm based fault tolerance, the type of recovery being performed has changed, but the programming model in which it executes has remained virtually static since the publication of the original Message Passing Interface (MPI) Standard in 1992. Since that time, applications have used a message passing paradigm to communicate between processes, but they could not perform process recovery within an MPI implementation due to limitations of the MPI Standard. This dissertation describes a new protocol using the exiting MPI Standard called Checkpoint-on-Failure to perform limited fault tolerance within the current framework of MPI, and proposes a new platform titled User Level Failure Mitigation (ULFM) to build more complete and complex fault tolerance solutions with a true fault tolerant MPI implementation. We will demonstrate the overhead involved in using these fault tolerant solutions and give examples of applications and libraries which construct other fault tolerance mechanisms based on the constructs provided in ULFM

University of Tennessee, Knoxville: Trace

CiteSeerX

Robust Scalable Sorting

Author: Axtmann Michael
Publication venue: KIT-Bibliothek, Karlsruhe
Publication date: 23/08/2021
Field of study

Sortieren ist eines der wichtigsten algorithmischen Grundlagenprobleme. Es ist daher nicht verwunderlich, dass Sortieralgorithmen in einer Vielzahl von Anwendungen benötigt werden. Diese Anwendungen werden auf den unterschiedlichsten Geräten ausgeführt -- angefangen bei Smartphones mit leistungseffizienten Multi-Core-Prozessoren bis hin zu Supercomputern mit Tausenden von Maschinen, die über ein Hochleistungsnetzwerk miteinander verbunden sind. Spätestens seitdem die Single-Core-Leistung nicht mehr signifikant steigt, sind parallele Anwendungen in unserem Alltag nicht mehr wegzudenken. Daher sind effiziente und skalierbare Algorithmen essentiell, um diese immense Verfügbarkeit von (paralleler) Rechenleistung auszunutzen. Diese Arbeit befasst sich damit, wie sequentielle und parallele Sortieralgorithmen auf möglichst robuste Art maximale Leistung erzielen können. Dabei betrachten wir einen großen Parameterbereich von Eingabegrößen, Eingabeverteilungen, Maschinen sowie Datentypen. Im ersten Teil dieser Arbeit untersuchen wir sowohl sequentielles Sortieren als auch paralleles Sortieren auf Shared-Memory-Maschinen. Wir präsentieren In-place Parallel Super Scalar Samplesort (IPS⁴o), einen neuen vergleichsbasierten Algorithmus, der mit beschränkt viel Zusatzspeicher auskommt (die sogenannte „in-place” Eigenschaft). Eine wesentliche Erkenntnis ist, dass unsere in-place-Technik die Sortiergeschwindigkeit von IPS⁴o im Vergleich zu ähnlichen Algorithmen ohne in-place-Eigenschaft verbessert. Bisher wurde die Eigenschaft, mit beschränkt viel Zusatzspeicher auszukommen, eher mit Leistungseinbußen verbunden. IPS⁴o ist außerdem cache-effizient und führt

O(n/t\log n)

Arbeitsschritte pro Thread aus, um ein Array der Größe

n

mit

t

Threads zu sortieren. Zusätzlich berücksichtigt IPS⁴o Speicherlokalität, nutzt einen Entscheidungsbaum ohne Sprungvorhersagen und verwendet spezielle Partitionen für Elemente mit gleichem Schlüssel. Für den Spezialfall, dass ausschließlich ganzzahlige Schlüssel sortiert werden sollen, haben wir das algorithmische Konzept von IPS⁴o wiederverwendet, um In-place Parallel Super Scalar Radix Sort (IPS²Ra) zu implementieren. Wir bestätigen die Performance unserer Algorithmen in einer umfangreichen experimentellen Studie mit 21 State-of-the-Art-Sortieralgorithmen, sechs Datentypen, zehn Eingabeverteilungen, vier Maschinen, vier Speicherzuordnungsstrategien und Eingabegrößen, die über sieben Größenordnungen variieren. Einerseits zeigt die Studie die robuste Leistungsfähigkeit unserer Algorithmen. Andererseits deckt sie auf, dass viele konkurrierende Algorithmen Performance-Probleme haben: Mit IPS⁴o erhalten wir einen robusten vergleichsbasierten Sortieralgorithmus, der andere parallele in-place vergleichsbasierte Sortieralgorithmen fast um den Faktor drei übertrifft. In der überwiegenden Mehrheit der Fälle ist IPS⁴o der schnellste vergleichsbasierte Algorithmus. Dabei ist es nicht von Bedeutung, ob wir IPS⁴o mit Algorithmen vergleichen, die mit beschränkt viel Zusatzspeicher auskommen, Zusatzspeicher in der Größenordnung der Eingabe benötigen, und parallel oder sequentiell ausgeführt werden. IPS⁴o übertrifft in vielen Fällen sogar konkurrierende Implementierungen von Integer-Sortieralgorithmen. Die verbleibenden Fälle umfassen hauptsächlich gleichmäßig verteilte Eingaben und Eingaben mit Schlüsseln, die nur wenige Bits enthalten. Diese Eingaben sind in der Regel „einfach” für Integer-Sortieralgorithmen. Unser Integer-Sorter IPS²Ra übertrifft andere Integer-Sortieralgorithmen für diese Eingaben in der überwiegenden Mehrheit der Fälle. Ausnahmen sind einige sehr kleine Eingaben, für die die meisten Algorithmen sehr ineffizient sind. Allerdings sind Algorithmen, die auf diese Eingabegrößen abzielen, in der Regel für alle anderen Eingaben deutlich langsamer. Im zweiten Teil dieser Arbeit untersuchen wir skalierbare Sortieralgorithmen für verteilte Systeme, welche robust in Hinblick auf die Eingabegröße, häufig vorkommende Sortierschlüssel, die Verteilung der Sortierschlüssel auf die Prozessoren und die Anzahl an Prozessoren sind. Das Resultat unserer Arbeit sind im Wesentlichen vier robuste skalierbare Sortieralgorithmen, mit denen wir den gesamten Bereich an Eingabegrößen abdecken können. Drei dieser vier Algorithmen sind neue, schnelle Algorithmen, welche so implementiert sind, dass sie nur einen geringen Zusatzaufwand benötigen und gleichzeitig unabhängig von „schwierigen” Eingaben robust skalieren. Es handelt sich z.B. um „schwierige” Eingaben, wenn viele gleiche Elemente vorkommen oder die Eingabeelemente in Hinblick auf ihre Sortierschlüssel ungünstig auf die Prozessoren verteilt sind. Bisherige Algorithmen für mittlere und größere Eingabegrößen weisen ein unzumutbar großes Kommunikationsvolumen auf oder tauschen unverhältnismäßig oft Nachrichten aus. Für diese Eingabegrößen beschreiben wir eine robuste, mehrstufige Verallgemeinerung von Samplesort, die einen brauchbaren Kompromiss zwischen dem Kommunikationsvolumen und der Anzahl ausgetauschter Nachrichten darstellt. Wir überwinden diese bisher unvereinbaren Ziele mittels einer skalierbaren approximativen Splitterauswahl sowie eines neuen Datenumverteilungsalgorithmus. Als eine Alternative stellen wir eine Verallgemeinerung von Mergesort vor, welche den Vorteil von perfekt ausbalancierter Ausgabe hat. Für kleine Eingaben entwerfen wir eine Variante von Quicksort. Mit wenig Zusatzaufwand vermeidet sie das Problem ungünstiger Elementverteilungen und häufig vorkommender Sortierschlüssel, indem sie schnell qualitativ hochwertige Splitter auswählt, die Elemente zufällig den Prozessoren zuweist und einer Duplikat-Behandlung unterzieht. Bisherige praktische Ansätze mit polylogarithmischer Latenz haben entweder einen logarithmischen Faktor mehr Kommunikationsvolumen oder berücksichtigen nur gleichverteilte Eingaben ohne mehrfach vorkommende Sortierschlüssel. Für sehr kleine Eingaben schlagen wir einen einfachen sowie schnellen, jedoch arbeitsineffizienten Algorithmus mit logarithmischer Latenzzeit vor. Für diese Eingaben sind bisherige effiziente Ansätze nur theoretische Algorithmen, die meist unverhältnismäßig große konstante Faktoren haben. Für die kleinsten Eingaben empfehlen wir die Daten zu sortieren, während sie an einen einzelnen Prozessor geschickt werden. Ein wichtiger Beitrag dieser Arbeit zu der praktischen Seite von Algorithm Engineering ist die Kommunikationsbibliothek RangeBasedComm (RBC). Mit RBC ermöglichen wir eine effiziente Umsetzung von rekursiven Algorithmen mit sublinearer Laufzeit, indem sie skalierbare und effiziente Kommunikationsfunktionen für Teilmengen von Prozessoren bereitstellt. Zuletzt präsentieren wir eine umfangreiche experimentelle Studie auf zwei Supercomputern mit bis zu 262144 Prozessorkernen, elf Algorithmen, zehn Eingabeverteilungen und Eingabegrößen variierend über neun Größenordnungen. Mit Ausnahme von den größten Eingabegrößen ist diese Arbeit die einzige, die überhaupt Sortierexperimente auf Maschinen dieser Größe durchführt. Die RBC-Bibliothek beschleunigt die Algorithmen teilweise drastisch – einen konkurrierenden Algorithmus sogar um mehr als zwei Größenordnungen. Die Studie legt dar, dass unsere Algorithmen robust sind und gleichzeitig konkurrierende Implementierungen leistungsmäßig deutlich übertreffen. Die Konkurrenten, die man normalerweise betrachtet hätte, stürzen bei „schwierigen” Eingaben sogar ab

KITopen