49 research outputs found

    Minimizing synchronizations in sparse iterative solvers for distributed supercomputers

    Get PDF
    Eliminating synchronizations is one of the important techniques related to minimizing communications for modern high performance computing. This paper discusses principles of reducing communications due to global synchronizations in sparse iterative solvers on distributed supercomputers. We demonstrates how to minimizing global synchronizations by rescheduling a typical Krylov subspace method. The benefit of minimizing synchronizations is shown in theoretical analysis and is verified by numerical experiments using up to 900 processors. The experiments also show the communication complexity for some structured sparse matrix vector multiplications and global communications in the underlying supercomputers are in the order P1/2.5 and P4/5 respectively, where P is the number of processors and the experiments were carried on a Dawning 5000A

    Inner product computation for sparse iterative solvers on\ud distributed supercomputer

    Get PDF
    Recent years have witnessed that iterative Krylov methods without re-designing are not suitable for distribute supercomputers because of intensive global communications. It is well accepted that re-engineering Krylov methods for prescribed computer architecture is necessary and important to achieve higher performance and scalability. The paper focuses on simple and practical ways to re-organize Krylov methods and improve their performance for current heterogeneous distributed supercomputers. In construct with most of current software development of Krylov methods which usually focuses on efficient matrix vector multiplications, the paper focuses on the way to compute inner products on supercomputers and explains why inner product computation on current heterogeneous distributed supercomputers is crucial for scalable Krylov methods. Communication complexity analysis shows that how the inner product computation can be the bottleneck of performance of (inner) product-type iterative solvers on distributed supercomputers due to global communications. Principles of reducing such global communications are discussed. The importance of minimizing communications is demonstrated by experiments using up to 900 processors. The experiments were carried on a Dawning 5000A, one of the fastest and earliest heterogeneous supercomputers in the world. Both the analysis and experiments indicates that inner product computation is very likely to be the most challenging kernel for inner product-based iterative solvers to achieve exascale

    Characterizing Synchronous Writes in Stable Memory Devices

    Full text link
    Distributed algorithms that operate in the fail-recovery model rely on the state stored in stable memory to guarantee the irreversibility of operations even in the presence of failures. The performance of these algorithms lean heavily on the performance of stable memory. Current storage technologies have a defined performance profile: data is accessed in blocks of hundreds or thousands of bytes, random access to these blocks is expensive and sequential access is somewhat better. File system implementations hide some of the performance limitations of the underlying storage devices using buffers and caches. However, fail-recovery distributed algorithms bypass some of these techniques and perform synchronous writes to be able to tolerate a failure during the write itself. Assuming the distributed system designer is able to buffer the algorithm's writes, we ask how buffer size and latency complement each other. In this paper we start to answer this question by characterizing the performance (throughput and latency) of typical stable memory devices using a representative set of current file systems.Comment: 14 page

    LBM and SPH Scalability Using Task-based Programming

    Get PDF
    Computational Fluid Dynamics encompasses a great variety of numerical approaches that approximate solutions to the Navier-Stokes equations, which generally describe the movements of viscous uid substances. While the objectives of these approaches are to capture related physical phenomena, the details of di erent methods lend them to particular classes of problems, and scalable solutions are important to a large range of scienti c and engineering applications. In this paper, we investigate the practical scalability of two proxy applications that are made to recreate the essential performance characteristics of Lattice-Boltzmann Methods (LBM) and Smoothed Particle Hydrodyamics (SPH), using the former to simulate the formation of vortices resulting from sustained, laminar ow, and the latter to simulate violent free surface ows without a mesh. The di ering scalability properties of these methods suggest di erent designs and programming methods in order to exploit extreme scale computing platforms. In particular, we investigate implementations that enable the use of task-based programming constructs, which have received attention in recent years as a means of enabling improved parallel scalability by relaxing the synchronization requirements of classical, bulk-synchronous execution that both LBM and SPH simulations exemplify. We nd that suitable adaptations of the central data structures suggest that scalable LBM performance can be improved by tasking constructs in situations that are determined by an appropriate match between the input problem and the platform's performance characteristics. This suggests an adaptive scheme to identify and select the highest performing implementation at program initialization. The SPH implementation admits a substantial performance gain by partitioning the physical domain into a greater number of independent tasks than the number of participating processors, but its performance remains dependent on a powerful node architecture to support conventional SMP workloads, suggesting that further algorithmic improvements beyond the bene ts of task programming are required to make it a strong candidate for exascale computing

    EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty

    Full text link
    Autoregressive decoding makes the inference of Large Language Models (LLMs) time-consuming. In this paper, we reconsider speculative sampling and derive two key observations. Firstly, autoregression at the feature (second-to-top-layer) level is more straightforward than at the token level. Secondly, the inherent uncertainty in feature (second-to-top-layer) level autoregression constrains its performance. Based on these insights, we introduce EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency), a simple yet highly efficient speculative sampling framework. By incorporating a token sequence advanced by one time step, EAGLE effectively resolves the uncertainty, enabling precise second-to-top-layer feature prediction with minimal overhead. We conducted comprehensive evaluations of EAGLE, including all models from the Vicuna and LLaMA2-Chat series, the MoE model Mixtral 8x7B Instruct, and tasks in dialogue, code generation, mathematical reasoning, and instruction following. For LLaMA2-Chat 70B, EAGLE achieved a latency speedup ratio of 2.7x-3.5x, doubled throughput, while maintaining the distribution of the generated text
    corecore