45 research outputs found

    Collective-Optimized FFTs

    Full text link
    This paper measures the impact of the various alltoallv methods. Results are analyzed within Beatnik, a Z-model solver that is bottlenecked by HeFFTe and representative of applications that rely on FFTs

    Optimizing computation-communication overlap in asynchronous task-based programs

    Get PDF
    Asynchronous task-based programming models are gaining popularity to address the programmability and performance challenges in high performance computing. One of the main attractions of these models and runtimes is their potential to automatically expose and exploit overlap of computation with communication. However, we find that inefficient interactions between these programming models and the underlying messaging layer (in most cases, MPI) limit the achievable computation-communication overlap and negatively impact the performance of parallel programs. We address this challenge by exposing and exploiting information about MPI internals in a task-based runtime system to make better task-creation and scheduling decisions. In particular, we present two mechanisms for exchanging information between MPI and a task-based runtime, and analyze their trade-offs. Further, we present a detailed evaluation of the proposed mechanisms implemented in MPI and a task-based runtime. We show performance improvements of up to 16.3% and 34.5% for proxy applications with point-to-point and collective communication, respectively.Peer ReviewedPostprint (author's final draft

    Optimizing Communication for Massively Parallel Processing

    Get PDF
    The current trends in high performance computing show that large machines with tens of thousands of processors will soon be readily available. The IBM Bluegene-L machine with 128k processors (which is currently being deployed) is an important step in this direction. In this scenario, it is going to be a significant burden for the programmer to manually scale his applications. This task of scaling involves addressing issues like load-imbalance and communication overhead. In this thesis, we explore several communication optimizations to help parallel applications to easily scale on a large number of processors. We also present automatic runtime techniques to relieve the programmer from the burden of optimizing communication in his applications. This thesis explores processor virtualization to improve communication performance in applications. With processor virtualization, the computation is mapped to virtual processors (VPs). After one VP has finished computation and is waiting for responses to its messages, another VP can compute, thus overlapping communication with computation. This overlap is only effective if the processor overhead of the communication operation is a small fraction of the total communication time. Fortunately, with network interfaces having co-processors, this happens to be true and processor virtualization has a natural advantage on such interconnects. The communication optimizations we present in this thesis, are motivated by applications such as NAMD (a classical molecular dynamics application) and CPAIMD (a quantum chemistry application). Applications like NAMD and CPAIMD consume a fair share of the time available on supercomputers. So, improving their performance would be of great value. We have successfully scaled NAMD to 1TF of peak performance on 3000 processors of PSC Lemieux, using the techniques presented in this thesis. We study both point-to-point communication and collective communication (specifically all-to-all communication). On a large number of processors all-to-all communication can take several milli-seconds to finish. With synchronous collectives defined in MPI, the processor idles while the collective messages are in flight. Therefore, we demonstrate an asynchronous collective communication framework, to let the CPU compute while the all-to-all messages are in flight. We also show that the best strategy for all-to-all communication depends on the message size, number of processors and other dynamic parameters. This suggests that these parameters can be observed at runtime and used to choose the optimal strategy for all-to-all communication. In this thesis, we demonstrate adaptive strategy switching for all-to-all communication. The communication optimization framework presented in this thesis, has been designed to optimize communication in the context of processor virtualization and dynamic migrating objects. We present the streaming strategy to optimize fine grained object-to-object communication. In this thesis, we motivate the need for hardware collectives, as processor based collectives can be delayed by intermediate that processors busy with computation. We explore a next generation interconnect that supports collectives in the switching hardware. We show the performance gains of hardware collectives through synthetic benchmarks

    Implémentation et évaluation d’algorithmes parallèles de FFTs 3D à base de modèles de composants logiciels

    Get PDF
    International audienceThe Fast Fourier Transform (FFT) is a widely-used building block for many high-performance scientific applications. Efficient computing of FFT is paramount for the performance of these applications. This has led to many efforts to implement machine and computation specific optimizations. However, no existing FFT library is capable of easily integrating and automating the selection of new and/or unique optimizations.To ease FFT specialization, this study evaluates the use of component-based software engineering, a programming paradigm which consists in building applications by assembling small software units. Component models are known to have many software engineering benefits but usually have insufficient performance for high-performance scientific applications.This talk uses the L²C model, a general purpose high-performance component model, and studies its performance and adaptation capabilities on 3D FFTs. Experiments show that L²C, and components in general, enables easy handling of 3D FFT specializations while obtaining performance comparable to that of well-known libraries.La transformée de Fourier rapide (FFT) est un élément fondamentale fréquemment utilisé dans de nombreuses applications scientifiques de haute performance. Calculer efficacement des FFT est ainsi primordial pour la performance de ces applications. Cela a conduit à de nombreux efforts pour implémenter des optimisations spécifiques à un matériel ou à une classe d'algorithmes donnée. Cependant, aucune bibliothèque de FFT existante permet facilement d'intégrer et d'automatiser la sélection de nouvelles optimisations et / ou d'optimisations uniques.Cette étude vise à évaluer l'utilisation de techniques de génie logicielle à base de composants, un paradigme de programmation qui consiste à construire des applications en assemblant de petites briques logiciels. Les modèles de composants sont connus pour avoir de nombreux avantages de génie logiciel, mais ont généralement des performances insuffisantes pour les applications scientifiques de haute performance.Cette étude s'intéresse à l'utilisation du modèle L²C, un modèle de composants de haute performance, et étudie ses performances et sa capacités à pouvoir adapter les applications de FFT 3D. Les expériences montrent que L²C et les composants en général, permettent de manipuler facilement la structure des applications de FFT 3D via une spécialisations d'assemblage, tout en obtenant des performances comparables à celle des bibliothèques bien connues

    High performance communication on reconfigurable clusters

    Get PDF
    High Performance Computing (HPC) has matured to where it is an essential third pillar, along with theory and experiment, in most domains of science and engineering. Communication latency is a key factor that is limiting the performance of HPC, but can be addressed by integrating communication into accelerators. This integration allows accelerators to communicate with each other without CPU interactions, and even bypassing the network stack. Field Programmable Gate Arrays (FPGAs) are the accelerators that currently best integrate communication with computation. The large number of Multi-gigabit Transceivers (MGTs) on most high-end FPGAs can provide high-bandwidth and low-latency inter-FPGA connections. Additionally, the reconfigurable FPGA fabric enables tight coupling between computation kernel and network interface. Our thesis is that an application-aware communication infrastructure for a multi-FPGA system makes substantial progress in solving the HPC communication bottleneck. This dissertation aims to provide an application-aware solution for communication infrastructure for FPGA-centric clusters. Specifically, our solution demonstrates application-awareness across multiple levels in the network stack, including low-level link protocols, router microarchitectures, routing algorithms, and applications. We start by investigating the low-level link protocol and the impact of its latency variance on performance. Our results demonstrate that, although some link jitter is always present, we can still assume near-synchronous communication on an FPGA-cluster. This provides the necessary condition for statically-scheduled routing. We then propose two novel router microarchitectures for two different kinds of workloads: a wormhole Virtual Channel (VC)-based router for workloads with dynamic communication, and a statically-scheduled Virtual Output Queueing (VOQ)-based router for workloads with static communication. For the first (VC-based) router, we propose a framework that generates application-aware router configurations. Our results show that, by adding application-awareness into router configuration, the network performance of FPGA clusters can be substantially improved. For the second (VOQ-based) router, we propose a novel offline collective routing algorithm. This shows a significant advantage over a state-of-the-art collective routing algorithm. We apply our communication infrastructure to a critical strong-scaling HPC kernel, the 3D FFT. The experimental results demonstrate that the performance of our design is faster than that on CPUs and GPUs by at least one order of magnitude (achieving strong scaling for the target applications). Surprisingly, the FPGA cluster performance is similar to that of an ASIC-cluster. We also implement the 3D FFT on another multi-FPGA platform: the Microsoft Catapult II cloud. Its performance is also comparable or superior to CPU and GPU HPC clusters. The second application we investigate is Molecular Dynamics Simulation (MD). We model MD on both FPGA clouds and clusters. We find that combining processing and general communication in the same device leads to extremely promising performance and the prospect of MD simulations well into the us/day range with a commodity cloud

    Nonblocking collectives for scalable Java communications

    Get PDF
    This is the peer reviewed version of the following article: Ramos, S., Taboada, G. L., Expósito, R. R., & Touriño, J. (2015). Nonblocking collectives for scalable Java communications. Concurrency and Computation: Practice and Experience, 27(5), 1169-1187, which has been published in final form at https://doi.org/10.1002/cpe.3279. This article may be used for non-commercial purposes in accordance with Wiley Terms and Conditions for Use of Self-Archived Versions.[Abstract] This paper presents a Java implementation of the recently published MPI 3.0 nonblocking message passing collectives in order to analyze and assess the feasibility of taking advantage of these operations in shared memory systems using Java. Nonblocking collectives aim to exploit the overlapping between computation and communication for collective operations to increase scalability of message passing codes, as it has been carried out for nonblocking point‐to‐point primitives. This scalability has become crucial not only for clusters but also for shared memory systems because of the current trend of increasing the number of cores per chip, which is leading to the generalization of multi‐core and many‐core processors. Message passing libraries based on remote direct memory access, thread‐based progression, or implementing pure multi‐threading shared memory support could potentially benefit from the lack of imposed synchronization by nonblocking collectives. But, although the distributed memory scenario has been well studied, the shared memory one has not been tackled yet. Hence, nonblocking collectives support has been included in FastMPJ, a Message Passing in Java (MPJ) implementation, and evaluated on a representative shared memory system, obtaining significant improvements because of overlapping and lack of implicit synchronization, and with barely any overhead imposed over common blocking operations.Ministerio de Ciencia e Innovación; TIN2010-16735Xunta de Galicia; CN2012/211Xunta de Galicia; GRC2013/05

    Optimization of communication intensive applications on HPC networks

    Get PDF
    Communication is a necessary but overhead inducing component of parallel programming. Its impact on application design and performance is due to several related aspects of a parallel job execution: network topology, routing protocol, suitability of algorithm being used to the network, job placement, etc. This thesis is aimed at developing an understanding of how communication plays out on networks of high performance computing systems and exploring methods that can be used to improve communication performance of large scale applications. Broadly speaking, three topics have been studied in detail in this thesis. The first of these topics is task mapping and job placement on practical installations of torus and dragonfly networks. Next, use of supervised learning algorithms for conducting diagnostic studies of how communication evolves on networks is explored. Finally, efficacy of packet-level simulations for prediction-based studies of communication performance on different networks using different network parameters is analyzed. The primary contribution of this thesis is development of scalable diagnostic and prediction methods that can assist in the process of network designing, adapting applications to future systems, and optimizing execution of applications on existing systems. These meth- ods include a supervised learning approach, a functional modeling tool (called Damselfly), and a PDES-based packet level simulator (called TraceR), all of which are described in this thesis
    corecore