34 research outputs found

    Scalability of Incompressible Flow Computations on Multi-GPU Clusters Using Dual-Level and Tri-Level Parallelism

    Get PDF
    High performance computing using graphics processing units (GPUs) is gaining popularity in the scientific computing field, with many large compute clusters being augmented with multiple GPUs in each node. We investigate hybrid tri-level (MPI-OpenMP-CUDA) parallel implementations to explore the efficiency and scalability of incompressible flow computations on GPU clusters up to 128 GPUS. This work details some of the unique issues faced when merging fine-grain parallelism on the GPU using CUDA with coarse-grain parallelism using OpenMP for intra-node and MPI for inter-node communication. Comparisons between the tri-level MPI-OpenMP-CUDA and dual-level MPI-CUDA implementations are shown using computationally large computational fluid dynamics (CFD) simulations. Our results demonstrate that a tri-level parallel implementation does not provide a significant advantage in performance over the dual-level implementation, however further research is needed to justify our conclusion for a cluster with a high GPU per node density or when using software that can utilize OpenMP’s fine-grain parallelism more effectively

    ATCOM: Automatically tuned collective communication system for SMP clusters.

    Get PDF
    Conventional implementations of collective communications are based on point-to-point communications, and their optimizations have been focused on efficiency of those communication algorithms. However, point-to-point communications are not the optimal choice for modern computing clusters of SMPs due to their two-level communication structure. In recent years, a few research efforts have investigated efficient collective communications for SMP clusters. This dissertation is focused on platform-independent algorithms and implementations in this area;There are two main approaches to implementing efficient collective communications for clusters of SMPs: using shared memory operations for intra-node communications, and over-lapping inter-node/intra-node communications. The former fully utilizes the hardware based shared memory of an SMP, and the latter takes advantage of the inherent hierarchy of the communications within a cluster of SMPs. Previous studies focused on clusters of SMP from certain vendors. However, the previously proposed methods are not portable to other systems. Because the performance optimization issue is very complicated and the developing process is very time consuming, it is highly desired to have self-tuning, platform-independent implementations. As proven in this dissertation, such an implementation can significantly outperform the other point-to-point based portable implementations and some platform-specific implementations;The dissertation describes in detail the architecture of the platform-independent implementation. There are four system components: shared memory-based collective communications, overlapping mechanisms for inter-node and intra-node communications, a prediction-based tuning module and a micro-benchmark based tuning module. Each component is carefully designed with the goal of automatic tuning in mind

    Optimizing message-passing performance within symmetric multiprocessor systems

    Get PDF
    The Message Passing Interface (MPI) has been widely used in the area of parallel computing due to its portability, scalability, and ease of use. Message passing within Symmetric Multiprocessor (SMP) systems is an import part of any MPI library since it enables parallel programs to run efficiently on SMP systems, or clusters of SMP systems when combined with other ways of communication such as TCP/IP. Most message-passing implementations use a shared memory pool as an intermediate buffer to hold messages, some lock mechanisms to protect the pool, and some synchronization mechanism for coordinating the processes. However, the performance varies significantly depending on how these are implemented. The work here implements two SMP message-passing modules using lock-based and lock-free approaches for MPLi̲te, a compact library that implements a subset of the most commonly used MPI functions. Various optimization techniques have been used to optimize the performance. These two modules are evaluated using a communication performance analysis tool called NetPIPE, and compared with the implementations of other MPI libraries such as MPICH, MPICH2, LAM/MPI and MPI/PRO. Performance tools such as PAPI and VTune are used to gather some runtime information at the hardware level. This information together with some cache theory and the hardware configuration is used to explain various performance phenomena. Tests using a real application have shown the performance of the different implementations in real practice. These results all show that the improvements of the new techniques over existing implementations

    Methods for Multilevel Parallelism on GPU Clusters: Application to a Multigrid Accelerated Navier-Stokes Solver

    Get PDF
    Computational Fluid Dynamics (CFD) is an important field in high performance computing with numerous applications. Solving problems in thermal and fluid sciences demands enormous computing resources and has been one of the primary applications used on supercomputers and large clusters. Modern graphics processing units (GPUs) with many-core architectures have emerged as general-purpose parallel computing platforms that can accelerate simulation science applications substantially. While significant speedups have been obtained with single and multiple GPUs on a single workstation, large problems require more resources. Conventional clusters of central processing units (CPUs) are now being augmented with GPUs in each compute-node to tackle large problems. The present research investigates methods of taking advantage of the multilevel parallelism in multi-node, multi-GPU systems to develop scalable simulation science software. The primary application the research develops is a cluster-ready GPU-accelerated Navier-Stokes incompressible flow solver that includes advanced numerical methods, including a geometric multigrid pressure Poisson solver. The research investigates multiple implementations to explore computation / communication overlapping methods. The research explores methods for coarse-grain parallelism, including POSIX threads, MPI, and a hybrid OpenMP-MPI model. The application includes a number of usability features, including periodic VTK (Visualization Toolkit) output, a run-time configuration file, and flexible setup of obstacles to represent urban areas and complex terrain. Numerical features include a variety of time-stepping methods, buoyancy-drivenflow, adaptive time-stepping, various iterative pressure solvers, and a new parallel 3D geometric multigrid solver. At each step, the project examines performance and scalability measures using the Lincoln Tesla cluster at the National Center for Supercomputing Applications (NCSA) and the Longhorn cluster at the Texas Advanced Computing Center (TACC). The results demonstrate that multi-GPU clusters can substantially accelerate computational fluid dynamics simulations

    Performance engineering of hybrid message passing + shared memory programming on multi-core clusters

    Get PDF
    The hybrid message passing + shared memory programming model combines two parallel programming styles within the same application in an effort to improve the performance and efficiency of parallel codes on modern multi-core clusters. This thesis presents a performance study of this model as it applies to two Molecular Dynamics (MD) applications. Both a large scale production MD code and a smaller scale example MD code have been adapted from existing message passing versions by adding shared memory parallelism to create hybrid message passing + shared memory applications. The performance of these hybrid applications has been investigated on different multi-core clusters and compared with the original pure message passing codes. This performance analysis reveals that the hybrid message passing + shared memory model provides performance improvements under some conditions, while the pure message passing model provides better performance in others. Typically, when running on small numbers of cores the pure message passing model provides better performance than the hybrid message passing + shared memory model, as hybrid performance suffers due to increased overheads from the use of shared memory constructs. However, when running on large numbers of cores the hybrid model performs better as these shared memory overheads are minimised while the pure message passing code suffers from increased communication overhead. These results depend on the interconnect used. Hybrid message passing + shared memory molecular dynamics codes are shown to exhibit different communication profiles from their pure message passing versions and this is revealed to be a large factor in the performance difference between pure message passing and hybrid message passing + shared memory codes. An extension of this result shows that the choice of interconnection fabric used in a multi-core cluster has a large impact on the performance difference between the pure message passing and the hybrid code. The factors affecting the performance of the applications have been analytically examined in an effort to describe, generalise and predict the performance of both the pure message passing and hybrid message passing + shared memory codes

    ATCOM: Automatically Tuned Collective Communication System for SMP Clusters

    Full text link

    Técnicas de optimización dinámicas de aplicaciones paralelas basadas en MPI

    Get PDF
    Parallel computation on cluster architectures has become the most common solution for developing high-performance scientific applications. Message Passing Interface (MPI) [Mes94] is the message-passing library most widely used to provide communications in clusters. MPI provides a standard interface for operations such as point-to-point communication, collective communication, synchronization, and I/O operations. Along the I/O phase, the processes frequently access a common data set by issuing a large number of small non-contiguous I/O requests [NKP+96a, SR98], which might create bottlenecks in the I/O subsystem. These bottlenecks are still higher in commodity clusters, where commercial networks are usually installed. Many of those networks, such as Fast Ethernet or Gigabit, have high latency and low bandwidth which introduce performance penalties during the program execution. Scalability is also an important issue in cluster systems when many processors are used, which may cause network saturation and still higher latencies. As communication-intensive parallel applications spend a significant amount of their total execution time exchanging data between processes, the former problems may lead to poor performance not only in the I/O subsystem, but also in communication phase. Therefore, we can conclude that it is necessary to develop techniques for improving the performance of both communication and I/O subsystems. The main goal of this Ph.D. thesis is to improve the scalability and performance of MPI-based applications executed in clusters reducing the overhead of I/O and communications subsystems. In summary, this work proposes two techniques that solve these problems in an efficient way managing the high complexity of a heterogeneous environment: • Reduction in the number of communications in collective I/O operations: This thesis targets the reduction of the bottleneck in the I/O subsystem. Many applications use collective I/O operations to read/write data from/to disk. One of the most used is the Two-Phase I/O technique extended by Thakur and Choudhary in ROMIO. In this technique, many communications among the processes are performed, which could create a bottleneck. This bottleneck is still higher in commodity clusters, where commercial networks are usually installed, and in CMP clusters where the I/O bus is shared by the cores of a single node. Therefore, we propose improving locality in order to reduce the number of communications performed in Two-Phase I/O. • Reduction of transferred data volume: This thesis attemps to reduce the cost of interchanged messages, reducing the data volume by using lossless compression among processes. Furthermore, we propose turning compression on and off and selecting at run-time the most appropriate compression algorithms depending on the characteristics of each message, network performance, and compression algorithms behavior.-------------------------------------------------------------------------------------------------------------------------------------------------En la actualidad, las aplicaciones utilizadas en los entornos de computación de altas prestaciones, como por ejemplo simulaciones científicas o aplicaciones dedicadas a la extracción de datos (data-mining), necesitan además de enormes recursos de cómputo y memoria, el manejo de ingentes volúmenes de información. Las arquitecturas cluster se han convertido en la solución más común para ejecutar este tipo de aplicaciones. La librería MPI (Message Passing Interface) [Mes94] es la más utilizada en estos entornos, ya que ofrece un interfaz estándar para operaciones de comunicación punto a punto, colectivas, sincronización y de E/S. Durante la fase de E/S de las aplicaciones, los procesos acceden a un gran conjunto de datos mediante pequeñas peticiones de datos no-contiguos, por lo que pueden provocar cuellos de botella en el sistema de E/S. Estos cuellos de botella, pueden ser todavía mayor en los cluster, ya que se suelen utilizar redes comerciales como Fast Ethernet o Gigabit, las cuales tienen una gran latencia y bajo ancho de banda. Por otra parte la escalabilidad es un importante problema en los clusters, cuando se ejecutan a la vez un gran número de procesos, ya que pueden causar saturación de la red, y aumenar la latencia. Como consecuencia de una comunicación intensiva, las aplicaciones gastan mucho tiempo intercambiando información entre los procesos, provocando problemas tanto en el sistema de comunicación, como en el de E/S. Por lo tanto, podemos concluir que en un cluster los subsistemas de E/S y de comunicaciones representan uno de los principales elementos en los que conviene mejorar su rendimiento. El principal objetivo de esta Tesis Doctoral es mejorar la escalabilidad y rendimientos de las aplicaciones MPI ejecutadas en arquitecturas cluster, reduciendo la sobrecarga de los sistemas de comunicación y de E/S. Como resumen, este trabajo propone dos técnicas para resolver estos problemas de forma eficiente: • Reducción del número de comunicaciones en la operaciones colectivas de E/S: Esta tesis tiene como uno de sus objetivos reducir los cuellos de botella producidos en el sistema de E/S. Muchas aplicaciones científicas utilizan operaciones colectivas de E/S para leer/escribir datos desde/al disco. Una de las técnicas más utilizas es Two-Phase I/O ampliada por Thakur and Choudhary en ROMIO. En esta técnica se realizan muchas comunicaciones entre los procesos, por lo que pueden crear un cuello de botella. Este cuello de botella es aún mayor en los cluster que tiene instaladas redes comerciales, y en los clusters multicore donde el bus de E/S es compartido por todos los cores de un mismo nodo. Por lo tanto, nosotros proponemos aumentar la localidad y disminuir a la vez en número de comunicaciones que se producen en Two-Phase I/O para reducir los problemas de E/S en las arquitecturas cluster. • Reducción del volumen de datos en las comunicaciones: Esta tesis propone reducir el coste de las comunicaciones utilizando técnicas de compresión sin perdida. Concretamente, proponemos activar y desactivar la compresión y elegir el algoritmo de compresión en tiempo de ejecución, dependiendo de las características de cada mensaje, de la red y del comportamiento de los algoritmos de compresión

    Analyzing Metadata Performance in Distributed File Systems

    Get PDF
    Distributed file systems are important building blocks in modern computing environments. The challenge of increasing I/O bandwidth to files has been largely resolved by the use of parallel file systems and sufficient hardware. However, determining the best means by which to manage large amounts of metadata, which contains information about files and directories stored in a distributed file system, has proved a more difficult challenge. The objective of this thesis is to analyze the role of metadata and present past and current implementations and access semantics. Understanding the development of the current file system interfaces and functionality is a key to understanding their performance limitations. Based on this analysis, a distributed metadata benchmark termed DMetabench is presented. DMetabench significantly improves on existing benchmarks and allows stress on metadata operations in a distributed file system in a parallelized manner. Both intranode and inter-node parallelity, current trends in computer architecture, can be explicitly tested with DMetabench. This is due to the fact that a distributed file system can have different semantics inside a client node rather than semantics between multiple nodes. As measurements in larger distributed environments may exhibit performance artifacts difficult to explain by reference to average numbers, DMetabench uses a time-logging technique to record time-related changes in the performance of metadata operations and also protocols additional details of the runtime environment for post-benchmark analysis. Using the large production file systems at the Leibniz Supercomputing Center (LRZ) in Munich, the functionality of DMetabench is evaluated by means of measurements on different distributed file systems. The results not only demonstrate the effectiveness of the methods proposed but also provide unique insight into the current state of metadata performance in modern file systems
    corecore