10 research outputs found

    Toward Reliable and Efficient Message Passing Software for HPC Systems: Fault Tolerance and Vector Extension

    Get PDF
    As the scale of High-performance Computing (HPC) systems continues to grow, researchers are devoted themselves to achieve the best performance of running long computing jobs on these systems. My research focus on reliability and efficiency study for HPC software. First, as systems become larger, mean-time-to-failure (MTTF) of these HPC systems is negatively impacted and tends to decrease. Handling system failures becomes a prime challenge. My research aims to present a general design and implementation of an efficient runtime-level failure detection and propagation strategy targeting large-scale, dynamic systems that is able to detect both node and process failures. Using multiple overlapping topologies to optimize the detection and propagation, minimizing the incurred overhead sand guaranteeing the scalability of the entire framework. Results from different machines and benchmarks compared to related works shows that my design and implementation outperforms non-HPC solutions significantly, and is competitive with specialized HPC solutions that can manage only MPI applications. Second, I endeavor to implore instruction level parallelization to achieve optimal performance. Novel processors support long vector extensions, which enables researchers to exploit the potential peak performance of target architectures. Intel introduced Advanced Vector Extension (AVX512 and AVX2) instructions for x86 Instruction Set Architecture (ISA). Arm introduced Scalable Vector Extension (SVE) with a new set of A64 instructions. Both enable greater parallelisms. My research utilizes long vector reduction instructions to improve the performance of MPI reduction operations. Also, I use gather and scatter feature to speed up the packing and unpacking operation in MPI. The evaluation of the resulting software stack under different scenarios demonstrates that the approach is not only efficient but also generalizable to many vector architecture and efficient

    Productive Programming Systems for Heterogeneous Supercomputers

    Get PDF
    The majority of today's scientific and data analytics workloads are still run on relatively energy inefficient, heavyweight, general-purpose processing cores, often referred to in the literature as latency-oriented architectures. The flexibility of these architectures and the programmer aids included (e.g. large and deep cache hierarchies, branch prediction logic, pre-fetch logic) makes them flexible enough to run a wide range of applications fast. However, we have started to see growth in the use of lightweight, simpler, energy-efficient, and functionally constrained cores. These architectures are commonly referred to as throughput-oriented. Within each shared memory node, the computational backbone of future throughput-oriented HPC machines will consist of large pools of lightweight cores. The first wave of throughput-oriented computing came in the mid 2000's with the use of GPUs for general-purpose and scientific computing. Today we are entering the second wave of throughput-oriented computing, with the introduction of NVIDIA Pascal GPUs, Intel Knights Landing Xeon Phi processors, the Epiphany Co-Processor, the Sunway MPP, and other throughput-oriented architectures that enable pre-exascale computing. However, while the majority of the FLOPS in designs for future HPC systems come from throughput-oriented architectures, they are still commonly paired with latency-oriented cores which handle management functions and lightweight/un-parallelizable computational kernels. Hence, most future HPC machines will be heterogeneous in their processing cores. However, the heterogeneity of future machines will not be limited to the processing elements. Indeed, heterogeneity will also exist in the storage, networking, memory, and software stacks of future supercomputers. As a result, it will be necessary to combine many different programming models and libraries in a single application. How to do so in a programmable and well-performing manner is an open research question. This thesis addresses this question using two approaches. First, we explore using managed runtimes on HPC platforms. As a result of their high-level programming models, these managed runtimes have a long history of supporting data analytics workloads on commodity hardware, but often come with overheads which make them less common in the HPC domain. Managed runtimes are also not supported natively on throughput-oriented architectures. Second, we explore the use of a modular programming model and work-stealing runtime to compose the programming and scheduling of multiple third-party HPC libraries. This approach leverages existing investment in HPC libraries, unifies the scheduling of work on a platform, and is designed to quickly support new programming model and runtime extensions. In support of these two approaches, this thesis also makes novel contributions in tooling for future supercomputers. We demonstrate the value of checkpoints as a software development tool on current and future HPC machines, and present novel techniques in performance prediction across heterogeneous cores

    Supercomputing Frontiers

    Get PDF
    This open access book constitutes the refereed proceedings of the 7th Asian Conference Supercomputing Conference, SCFA 2022, which took place in Singapore in March 2022. The 8 full papers presented in this book were carefully reviewed and selected from 21 submissions. They cover a range of topics including file systems, memory hierarchy, HPC cloud platform, container image configuration workflow, large-scale applications, and scheduling

    Parallel Asynchronous Matrix Multiplication for a Distributed Pipelined Neural Network

    Get PDF
    Machine learning is an approach to devise algorithms that compute an output without a given rule set but based on a self-learning concept. This approach is of great importance for several fields of applications in science and industry where traditional programming methods are not sufficient. In neural networks, a popular subclass of machine learning algorithms, commonly previous experience is used to train the network and produce good outputs for newly introduced inputs. By increasing the size of the network more complex problems can be solved which again rely on a huge amount of training data. Increasing the complexity also leads to higher computational demand and storage requirements and to the need for parallelization. Several parallelization approaches of neural networks have already been considered. Most approaches use special purpose hardware whilst other work focuses on using standard hardware. Often these approaches target the problem by parallelizing the training data. In this work a new parallelization method named poadSGD is proposed for the parallelization of fully-connected, largescale feedforward networks on a compute cluster with standard hardware. poadSGD is based on the stochastic gradient descent algorithm. A block-wise distribution of the network's layers to groups of processes and a pipelining scheme for batches of the training samples are used. The network is updated asynchronously without interrupting ongoing computations of subsequent batches. For this task a one-sided communication scheme is used. A main algorithmic part of the batch-wise pipelined version consists of matrix multiplications which occur for a special distributed setup, where each matrix is held by a different process group. GASPI, a parallel programming model from the field of "Partitioned Global Address Spaces" (PGAS) models is introduced and compared to other models from this class. As it mainly relies on one-sided and asynchronous communication it is a perfect candidate for the asynchronous update task in the poadSGD algorithm. Therefore, the matrix multiplication is also implemented based GASPI. In order to efficiently handle upcoming synchronizations within the process groups and achieve a good workload distribution, a two-dimensional block-cyclic data distribution is applied for the matrices. Based on this distribution, the multiplication algorithm is computed by diagonally iterating over the sub blocks of the resulting matrix and computing the sub blocks in subgroups of the processes. The sub blocks are computed by sharing the workload between the process groups and communicating mostly in pairs or in subgroups. The communication in pairs is set up to be overlapped by other ongoing computations. The implementations provide a special challenge, since the asynchronous communication routines must be handled with care as to which processor is working at what point in time with which data in order to prevent an unintentional dual use of data. The theoretical analysis shows the matrix multiplication to be superior to a naive implementation when the dimension of the sub blocks of the matrices exceeds 382. The performance achieved in the test runs did not withstand the expectations the theoretical analysis predicted. The algorithm is executed on up to 512 cores and for matrices up to a size of 131,072 x 131,072. The implementation using the GASPI API was found not be straightforward but to provide a good potential for overlapping communication with computations whenever the data dependencies of an application allow for it. The matrix multiplication was successfully implemented and can be used within an implementation of the poadSGD method that is yet to come. The poadSGD method seems to be very promising, especially as nowadays, with the larger amount of data and the increased complexity of the applications, the approaches to parallelization of neural networks are increasingly of interest

    Software for Exascale Computing - SPPEXA 2016-2019

    Get PDF
    This open access book summarizes the research done and results obtained in the second funding phase of the Priority Program 1648 "Software for Exascale Computing" (SPPEXA) of the German Research Foundation (DFG) presented at the SPPEXA Symposium in Dresden during October 21-23, 2019. In that respect, it both represents a continuation of Vol. 113 in Springer’s series Lecture Notes in Computational Science and Engineering, the corresponding report of SPPEXA’s first funding phase, and provides an overview of SPPEXA’s contributions towards exascale computing in today's sumpercomputer technology. The individual chapters address one or more of the research directions (1) computational algorithms, (2) system software, (3) application software, (4) data management and exploration, (5) programming, and (6) software tools. The book has an interdisciplinary appeal: scholars from computational sub-fields in computer science, mathematics, physics, or engineering will find it of particular interest

    Accelerating Network Communication and I/O in Scientific High Performance Computing Environments

    Get PDF
    High performance computing has become one of the major drivers behind technology inventions and science discoveries. Originally driven through the increase of operating frequencies and technology scaling, a recent slowdown in this evolution has led to the development of multi-core architectures, which are supported by accelerator devices such as graphics processing units (GPUs). With the upcoming exascale era, the overall power consumption and the gap between compute capabilities and I/O bandwidth have become major challenges. Nowadays, the system performance is dominated by the time spent in communication and I/O, which highly depends on the capabilities of the network interface. In order to cope with the extreme concurrency and heterogeneity of future systems, the software ecosystem of the interconnect needs to be carefully tuned to excel in reliability, programmability, and usability. This work identifies and addresses three major gaps in today's interconnect software systems. The I/O gap describes the disparity in operating speeds between the computing capabilities and second storage tiers. The communication gap is introduced through the communication overhead needed to synchronize distributed large-scale applications and the mixed workload. The last gap is the so called concurrency gap, which is introduced through the extreme concurrency and the inflicted learning curve posed to scientific application developers to exploit the hardware capabilities. The first contribution is the introduction of the network-attached accelerator approach, which moves accelerators into a "stand-alone" cluster connected through the Extoll interconnect. The novel communication architecture enables the direct accelerators communication without any host interactions and an optimal application-to-compute-resources mapping. The effectiveness of this approach is evaluated for two classes of accelerators: Intel Xeon Phi coprocessors and NVIDIA GPUs. The next contribution comprises the design, implementation, and evaluation of the support of legacy codes and protocols over the Extoll interconnect technology. By providing TCP/IP protocol support over Extoll, it is shown that the performance benefits of the interconnect can be fully leveraged by a broader range of applications, including the seamless support of legacy codes. The third contribution is twofold. First, a comprehensive analysis of the Lustre networking protocol semantics and interfaces is presented. Afterwards, these insights are utilized to map the LNET protocol semantics onto the Extoll networking technology. The result is a fully functional Lustre network driver for Extoll. An initial performance evaluation demonstrates promising bandwidth and message rate results. The last contribution comprises the design, implementation, and evaluation of two easy-to-use load balancing frameworks, which transparently distribute the I/O workload across all available storage system components. The solutions maximize the parallelization and throughput of file I/O. The frameworks are evaluated on the Titan supercomputing systems for three I/O interfaces. For example for large-scale application runs, POSIX I/O and MPI-IO can be improved by up to 50% on a per job basis, while HDF5 shows performance improvements of up to 32%
    corecore