949 research outputs found

    Memory performance of and-parallel prolog on shared-memory architectures

    Get PDF
    The goal of the RAP-WAM AND-parallel Prolog abstract architecture is to provide inference speeds significantly beyond those of sequential systems, while supporting Prolog semantics and preserving sequential performance and storage efficiency. This paper presents simulation results supporting these claims with special emphasis on memory performance on a two-level sharedmemory multiprocessor organization. Several solutions to the cache coherency problem are analyzed. It is shown that RAP-WAM offers good locality and storage efficiency and that it can effectively take advantage of broadcast caches. It is argued that speeds in excess of 2 ML IPS on real applications exhibiting medium parallelism can be attained with current technology

    Achieving Efficient Strong Scaling with PETSc using Hybrid MPI/OpenMP Optimisation

    Full text link
    The increasing number of processing elements and decreas- ing memory to core ratio in modern high-performance platforms makes efficient strong scaling a key requirement for numerical algorithms. In order to achieve efficient scalability on massively parallel systems scientific software must evolve across the entire stack to exploit the multiple levels of parallelism exposed in modern architectures. In this paper we demonstrate the use of hybrid MPI/OpenMP parallelisation to optimise parallel sparse matrix-vector multiplication in PETSc, a widely used scientific library for the scalable solution of partial differential equations. Using large matrices generated by Fluidity, an open source CFD application code which uses PETSc as its linear solver engine, we evaluate the effect of explicit communication overlap using task-based parallelism and show how to further improve performance by explicitly load balancing threads within MPI processes. We demonstrate a significant speedup over the pure-MPI mode and efficient strong scaling of sparse matrix-vector multiplication on Fujitsu PRIMEHPC FX10 and Cray XE6 systems

    Reliable and Efficient In-Memory Fault Tolerance of Large Language Model Pretraining

    Full text link
    Extensive system scales (i.e. thousands of GPU/TPUs) and prolonged training periods (i.e. months of pretraining) significantly escalate the probability of failures when training large language models (LLMs). Thus, efficient and reliable fault-tolerance methods are in urgent need. Checkpointing is the primary fault-tolerance method to periodically save parameter snapshots from GPU memory to disks via CPU memory. In this paper, we identify the frequency of existing checkpoint-based fault-tolerance being significantly limited by the storage I/O overheads, which results in hefty re-training costs on restarting from the nearest checkpoint. In response to this gap, we introduce an in-memory fault-tolerance framework for large-scale LLM pretraining. The framework boosts the efficiency and reliability of fault tolerance from three aspects: (1) Reduced Data Transfer and I/O: By asynchronously caching parameters, i.e., sharded model parameters, optimizer states, and RNG states, to CPU volatile memory, Our framework significantly reduces communication costs and bypasses checkpoint I/O. (2) Enhanced System Reliability: Our framework enhances parameter protection with a two-layer hierarchy: snapshot management processes (SMPs) safeguard against software failures, together with Erasure Coding (EC) protecting against node failures. This double-layered protection greatly improves the survival probability of the parameters compared to existing checkpointing methods. (3) Improved Snapshotting Frequency: Our framework achieves more frequent snapshotting compared with asynchronous checkpointing optimizations under the same saving time budget, which improves the fault tolerance efficiency. Empirical results demonstrate that Our framework minimizes the overhead of fault tolerance of LLM pretraining by effectively leveraging redundant CPU resources.Comment: Fault Tolerance, Checkpoint Optimization, Large Language Model, 3D parallelis

    Deterministic Consistency: A Programming Model for Shared Memory Parallelism

    Full text link
    The difficulty of developing reliable parallel software is generating interest in deterministic environments, where a given program and input can yield only one possible result. Languages or type systems can enforce determinism in new code, and runtime systems can impose synthetic schedules on legacy parallel code. To parallelize existing serial code, however, we would like a programming model that is naturally deterministic without language restrictions or artificial scheduling. We propose "deterministic consistency", a parallel programming model as easy to understand as the "parallel assignment" construct in sequential languages such as Perl and JavaScript, where concurrent threads always read their inputs before writing shared outputs. DC supports common data- and task-parallel synchronization abstractions such as fork/join and barriers, as well as non-hierarchical structures such as producer/consumer pipelines and futures. A preliminary prototype suggests that software-only implementations of DC can run applications written for popular parallel environments such as OpenMP with low (<10%) overhead for some applications.Comment: 7 pages, 3 figure

    Workload distribution for ray tracing in multi-core systems

    Get PDF
    One of the features that made interactive ray tracing possible over the last few years was the careful exploitation of the computational power and parallelism available on modern multicore processors. Multithreaded interactive ray tracing engines have to share the workload (rays to be processed) among rendering threads. This may be achieved by storing tasks on a shared FIFO-queue, accessed by all threads. Accessing this shared data structure requires a data access control mechanism, which ensures that the data structure is not corrupted. This access mechanism must incur minimal overheads such that performance is not penalized. This paper proposes a lock-free data access control mechanism to such queue, which avoids all locks by carefully reordering instructions. This technique is compared with a classical lock-based approach and with a conservative local technique, where each thread maintains its local queue of tasks and shares nothing with other threads. Although the local approach outperforms the other two due to very good load balancing conditions, we demonstrate that the lock-free approach outperforms the lock-based one for large processor counts. Efficient and reliable sharing of data structures within a shared memory system is becoming a very relevant problem with the advent of many core processors. Lock free approaches are a promising manner of achieving such goal

    A Flexible Framework For Implementing Multi-Nested Software Transaction Memory

    Get PDF
    Programming with locks is very difficult in multi-threaded programmes. Concurrency control of access to shared data limits scalable locking strategies otherwise provided for in software transaction memory. This work addresses the subject of creating dependable software in the face of eminent failures. In the past, programmers who used lock-based synchronization to implement concurrent access to shared data had to grapple with problems with conventional locking techniques such as deadlocks, convoying, and priority inversion. This paper proposes another advanced feature for Dynamic Software Transactional Memory intended to extend the concepts of transaction processing to provide a nesting mechanism and efficient lock-free synchronization, recoverability and restorability. In addition, the code for implementation has also been researched, coded, tested, and implemented to achieve the desired objectives
    • …
    corecore