375 research outputs found

    Scalable and Reliable Sparse Data Computation on Emergent High Performance Computing Systems

    Get PDF
    Heterogeneous systems with both CPUs and GPUs have become important system architectures in emergent High Performance Computing (HPC) systems. Heterogeneous systems must address both performance-scalability and power-scalability in the presence of failures. Aggressive power reduction pushes hardware to its operating limit and increases the failure rate. Resilience allows programs to progress when subjected to faults and is an integral component of large-scale systems, but incurs significant time and energy overhead. The future exascale systems are expected to have higher power consumption with higher fault rates. Sparse data computation is the fundamental kernel in many scientific applications. It is suitable for the studies of scalability and resilience on heterogeneous systems due to its computational characteristics. To deliver the promised performance within the given power budget, heterogeneous computing mandates a deep understanding of the interplay between scalability and resilience. Managing scalability and resilience is challenging in heterogeneous systems, due to the heterogeneous compute capability, power consumption, and varying failure rates between CPUs and GPUs. Scalability and resilience have been traditionally studied in isolation, and optimizing one typically detrimentally impacts the other. While prior works have been proved successful in optimizing scalability and resilience on CPU-based homogeneous systems, simply extending current approaches to heterogeneous systems results in suboptimal performance-scalability and/or power-scalability. To address the above multiple research challenges, we propose novel resilience and energy-efficiency technologies to optimize scalability and resilience for sparse data computation on heterogeneous systems with CPUs and GPUs. First, we present generalized analytical and experimental methods to analyze and quantify the time and energy costs of various recovery schemes, and develop and prototype performance optimization and power management strategies to improve scalability for sparse linear solvers. Our results quantitatively reveal that each resilience scheme has its own advantages depending on the fault rate, system size, and power budget, and the forward recovery can further benefit from our performance and power optimizations for large-scale computing. Second, we design a novel resilience technique that relaxes the requirement of synchronization and identicalness for processes, and allows them to run in heterogeneous resources with power reduction. Our results show a significant reduction in energy for unmodified programs in various fault situations compared to exact replication techniques. Third, we propose a novel distributed sparse tensor decomposition that utilizes an asynchronous RDMA-based approach with OpenSHMEM to improve scalability on large-scale systems and prove that our method works well in heterogeneous systems. Our results show our irregularity-aware workload partition and balanced-asynchronous algorithms are scalable and outperform the state-of-the-art distributed implementations. We demonstrate that understanding different bottlenecks for various types of tensors plays critical roles in improving scalability

    Scheduling Massively Parallel Multigrid for Multilevel Monte Carlo Methods

    Get PDF
    The computational complexity of naive, sampling-based uncertainty quantification for 3D partial differential equations is extremely high. Multilevel approaches, such as multilevel Monte Carlo (MLMC), can reduce the complexity significantly when they are combined with a fast multigrid solver, but to exploit them fully in a parallel environment, sophisticated scheduling strategies are needed. We optimize the concurrent execution across the three layers of the MLMC method: parallelization across levels, across samples, and across the spatial grid. In a series of numerical tests, the influence on the overall performance of the “scalability window” of the multigrid solver (i.e., the range of processor numbers over which good parallel efficiency can be maintained) is illustrated. Different homogeneous and heterogeneous scheduling strategies are proposed and discussed. Finally, large 3D scaling experiments are carried out, including adaptivity

    Coordinated power management in heterogeneous processors

    Get PDF
    Coordinated Power Management in Heterogeneous Processors Indrani Paul 164 pages Directed by Dr. Sudhakar Yalamanchili With the end of Dennard scaling, the scaling of device feature size by itself no longer guarantees sustaining the performance improvement predicted by Moore’s Law. As industry moves to increasingly small feature sizes, performance scaling will become dominated by the physics of the computing environment and in particular by the transient behavior of interactions between power delivery, power management and thermal fields. Consequently, performance scaling must be improved by managing interactions between physical properties, which we refer to as processor physics, and system level performance metrics, thereby improving the overall efficiency of the system. The industry shift towards heterogeneous computing is in large part motivated by energy efficiency. While such tightly coupled systems benefit from reduced latency and improved performance, they also give rise to new management challenges due to phenomena such as physical asymmetry in thermal and power signatures between the diverse elements and functional asymmetry in performance. Power-performance tradeoffs in heterogeneous processors are determined by coupled behaviors between major components due to the i) on-die integration, ii) programming model and the iii) processor physics. Towards this end, this thesis demonstrates the needs for coordinated management of functional and physical resources of a heterogeneous system across all major compute and memory elements. It shows that the interactions among performance, power delivery and different types of coupling phenomena are not an artifact of an architecture instance, but is fundamental to the operation of many core and heterogeneous architectures. Managing such coupling effects is a central focus of this dissertation. This awareness has the potential to exert significant influence over the design of future power and performance management algorithms. The high-level contributions of this thesis are i) in-depth examination of characteristics and performance demands of emerging applications using hardware measurements and analysis from state-of-the-art heterogeneous processors and high-performance GPUs, ii) analysis of the effects of processor physics such as power and thermals on system level performance, iii) identification of a key set of run-time metrics that can be used to manage these effects, and iv) development and detailed evaluation of online coordinated power management techniques to optimize system level global metrics in heterogeneous CPU-GPU-memory processors.Ph.D

    Scheduling Massively Parallel Multigrid for Multilevel Monte Carlo Methods

    Get PDF
    The computational complexity of naive, sampling-based uncertainty quantification for 3D partial differential equations is extremely high. Multilevel approaches, such as multilevel Monte Carlo (MLMC), can reduce the complexity significantly when they are combined with a fast multigrid solver, but to exploit them fully in a parallel environment, sophisticated scheduling strategies are needed. We optimize the concurrent execution across the three layers of the MLMC method: parallelization across levels, across samples, and across the spatial grid. In a series of numerical tests, the influence on the overall performance of the “scalability window” of the multigrid solver (i.e., the range of processor numbers over which good parallel efficiency can be maintained) is illustrated. Different homogeneous and heterogeneous scheduling strategies are proposed and discussed. Finally, large 3D scaling experiments are carried out, including adaptivity

    A framework for evaluating the impact of communication on performance in large-scale distributed urban simulations

    Get PDF
    A primary motivation for employing distributed simulation is to enable the execution of large-scale simulation workloads that cannot be handled by the resources of a single stand-alone computing node. To make execution possible, the workload is distributed among multiple computing nodes connected to one another via a communication network. The execution of a distributed simulation involves alternating phases of computation and communication to coordinate the co-operating nodes and ensure correctness of the resulting simulation outputs. Reliably estimating the execution performance of a distributed simulation can be difficult due to non-deterministic execution paths involved in alternating computation and communication operations. However, performance estimates are useful as a guide for the simulation time that can be expected when using a given set of computing resources. Performance estimates can support decisions to commit time and resources to running distributed simulations, especially where significant amounts of funds or computing resources are necessary. Various performance estimation approaches are employed in the distributed computing literature, including the influential Bulk Synchronous Parallel (BSP) and LogP models. Different approaches make various assumptions that render them more suitable for some applications than for others. Actual performance depends on characteristics inherent to each distributed simulation application. An important aspect of these individual characteristics is the dynamic relationship between the communication and computation phases of the distributed simulation application. This work develops a framework for estimating the performance of distributed simulation applications, focusing mainly on aspects relevant to the dynamic relationship between communication and computation during distributed simulation execution. The framework proposes a meta-simulation approach based on the Multi-Agent Simulation (MAS) paradigm. Using the approach proposed by the framework, meta-simulations can be developed to investigate the performance of specific distributed simulation applications. The proposed approach enables the ability to compare various what-if scenarios. This ability is useful for comparing the effects of various parameters and strategies such as the number of computing nodes, the communication strategy, and the workload-distribution strategy. The proposed meta-simulation approach can also aid a search for optimal parameters and strategies for specific distributed simulation applications. The framework is demonstrated by implementing a meta-simulation which is based on case studies from the Urban Simulation domain

    Evaluation of Distributed Programming Models and Extensions to Task-based Runtime Systems

    Get PDF
    High Performance Computing (HPC) has always been a key foundation for scientific simulation and discovery. And more recently, deep learning models\u27 training have further accelerated the demand of computational power and lower precision arithmetic. In this era following the end of Dennard\u27s Scaling and when Moore\u27s Law seemingly still holds true to a lesser extent, it is not a coincidence that HPC systems are equipped with multi-cores CPUs and a variety of hardware accelerators that are all massively parallel. Coupling this with interconnect networks\u27 speed improvements lagging behind those of computational power increases, the current state of HPC systems is heterogeneous and extremely complex. This was heralded as a great challenge to the software stacks and their ability to extract performance from these systems, but also as a great opportunity to innovate at the programming model level to explore the different approaches and propose new solutions. With usability, portability, and performance as the main factors to consider, this dissertation first evaluates some of the widely used parallel programming models (MPI, MPI+OpenMP, and task-based runtime systems) ability to manage the load imbalance among the processes computing the LU factorization of a large dense matrix stored in the Block Low-Rank (BLR) format. Next I proposed a number of optimizations and implemented them in PaRSEC\u27s Dynamic Task Discovery (DTD) model, including user-level graph trimming and direct Application Programming Interface (API) calls to perform data broadcast operation to further extend the limit of STF model. On the other hand, the Parameterized Task Graph (PTG) approach in PaRSEC is the most scalable approach for many different applications, which I then explored the possibility of combining both the algorithmic approach of Communication-Avoiding (CA) and the communication-computation overlapping benefits provided by runtime systems using 2D five-point stencil as the test case. This broad programming models evaluation and extension work highlighted the abilities of task-based runtime system in achieving scalable performance and portability on contemporary heterogeneous HPC systems. Finally, I summarized the profiling capability of PaRSEC runtime system, and demonstrated with a use case its important role in the performance bottleneck identification leading to optimizations

    Optimizing Collective Communication for Scalable Scientific Computing and Deep Learning

    Get PDF
    In the realm of distributed computing, collective operations involve coordinated communication and synchronization among multiple processing units, enabling efficient data exchange and collaboration. Scientific applications, such as simulations, computational fluid dynamics, and scalable deep learning, require complex computations that can be parallelized across multiple nodes in a distributed system. These applications often involve data-dependent communication patterns, where collective operations are critical for achieving high performance in data exchange. Optimizing collective operations for scientific applications and deep learning involves improving the algorithms, communication patterns, and data distribution strategies to minimize communication overhead and maximize computational efficiency. Within the context of this dissertation, the specific focus is on optimizing the alltoall operation in 3D Fast Fourier Transform (FFT) applications and the allreduce operation in parallel deep learning, particularly on High-Performance Computing (HPC) systems. Advanced communication algorithms and methods are explored and implemented to improve communication efficiency, consequently enhancing the overall performance of 3D FFT applications. Furthermore, this dissertation investigates the identification of performance bottlenecks during collective communication over Horovod on distributed systems. These bottlenecks are addressed by proposing an optimized parallel communication pattern specifically tailored to alleviate the aforementioned limitations during the training phase in distributed deep learning. The objective is to achieve faster convergence and improve the overall training efficiency. Moreover, this dissertation proposes fault tolerance and elastic scaling features for distributed deep learning by leveraging the User-Level Failure Mitigation (ULFM) from Message Passing Interface (MPI). By incorporating ULFM MPI, the dissertation aims to enhance the elastic capabilities of distributed deep learning systems. This approach enables graceful and lightweight handling of failures while facilitating seamless scaling in dynamic computing environments

    Runtime support for load balancing of parallel adaptive and irregular applications

    Get PDF
    Applications critical to today\u27s engineering research often must make use of the increased memory and processing power of a parallel machine. While advances in architecture design are leading to more and more powerful parallel systems, the software tools needed to realize their full potential are in a much less advanced state. In particular, efficient, robust, and high-performance runtime support software is critical in the area of dynamic load balancing. While the load balancing of loosely synchronous codes, such as field solvers, has been studied extensively for the past 15 years, there exists a class of problems, known as asynchronous and highly adaptive , for which the dynamic load balancing problem remains open. as we discuss, characteristics of this class of problems render compile-time or static analysis of little benefit, and complicate the dynamic load balancing task immensely.;We make two contributions to this area of research. The first is the design and development of a runtime software toolkit, known as the Parallel Runtime Environment for Multi-computer Applications, or PREMA, which provides interprocessor communication, a global namespace, a framework for the implementation of customized scheduling policies, and several such policies which are prevalent in the load balancing literature. The PREMA system is designed to support coarse-grained domain decompositions with the goals of portability, flexibility, and maintainability in mind, so that developers will quickly feel comfortable incorporating it into existing codes and developing new codes which make use of its functionality. We demonstrate that the programming model and implementation are efficient and lead to the development of robust and high-performance applications.;Our second contribution is in the area of performance modeling. In order to make the most effective use of the PREMA runtime software, certain parameters governing its execution must be set off-line. Optimal values for these parameters may be determined through repeated executions of the target application; however, this is not always possible, particularly in large-scale environments and long-running applications. We present an analytic model that allows the user to quickly and inexpensively predict application performance and fine-tune applications built on the PREMA platform

    A framework for evaluating the impact of communication on performance in large-scale distributed urban simulations

    Get PDF
    A primary motivation for employing distributed simulation is to enable the execution of large-scale simulation workloads that cannot be handled by the resources of a single stand-alone computing node. To make execution possible, the workload is distributed among multiple computing nodes connected to one another via a communication network. The execution of a distributed simulation involves alternating phases of computation and communication to coordinate the co-operating nodes and ensure correctness of the resulting simulation outputs. Reliably estimating the execution performance of a distributed simulation can be difficult due to non-deterministic execution paths involved in alternating computation and communication operations. However, performance estimates are useful as a guide for the simulation time that can be expected when using a given set of computing resources. Performance estimates can support decisions to commit time and resources to running distributed simulations, especially where significant amounts of funds or computing resources are necessary. Various performance estimation approaches are employed in the distributed computing literature, including the influential Bulk Synchronous Parallel (BSP) and LogP models. Different approaches make various assumptions that render them more suitable for some applications than for others. Actual performance depends on characteristics inherent to each distributed simulation application. An important aspect of these individual characteristics is the dynamic relationship between the communication and computation phases of the distributed simulation application. This work develops a framework for estimating the performance of distributed simulation applications, focusing mainly on aspects relevant to the dynamic relationship between communication and computation during distributed simulation execution. The framework proposes a meta-simulation approach based on the Multi-Agent Simulation (MAS) paradigm. Using the approach proposed by the framework, meta-simulations can be developed to investigate the performance of specific distributed simulation applications. The proposed approach enables the ability to compare various what-if scenarios. This ability is useful for comparing the effects of various parameters and strategies such as the number of computing nodes, the communication strategy, and the workload-distribution strategy. The proposed meta-simulation approach can also aid a search for optimal parameters and strategies for specific distributed simulation applications. The framework is demonstrated by implementing a meta-simulation which is based on case studies from the Urban Simulation domain

    Source Code Transformations Strategies to Load-Balance Grid Applications

    Full text link
    corecore