949 research outputs found
Memory performance of and-parallel prolog on shared-memory architectures
The goal of the RAP-WAM AND-parallel Prolog abstract architecture is to provide inference speeds significantly
beyond those of sequential systems, while supporting Prolog semantics and preserving sequential performance and storage efficiency. This paper presents simulation results supporting these claims with special emphasis on memory performance on a two-level sharedmemory multiprocessor organization. Several solutions to the cache coherency problem are analyzed. It is shown that RAP-WAM offers good locality and storage efficiency and that it can effectively take advantage of broadcast caches. It is argued that speeds in excess of 2 ML IPS on real applications exhibiting medium parallelism can be attained with current technology
Recommended from our members
Computer-aided programming for multiprocessing systems
As both the number of processors and the complexity of problems to be solved increase, programming multiprocessing systems becomes more difficult and error-prone. This report discusses parallel models of computation and tools for computer-aided programming (CAP). Program development tools are necessary since programmers are not able to develop complex parallel programs efficiently. In particular, a CAP tool, named Hypertool, is described here. It performs scheduling and handles the communication primitive insertion automatically so that many errors are eliminated. It also generates the performance estimates and other program quality measures to help programmers in improving their algorithms and programs. Experiments have shown that up to a 300% performance improvement can be achieved by computer-aided programming
Achieving Efficient Strong Scaling with PETSc using Hybrid MPI/OpenMP Optimisation
The increasing number of processing elements and decreas- ing memory to core
ratio in modern high-performance platforms makes efficient strong scaling a key
requirement for numerical algorithms. In order to achieve efficient scalability
on massively parallel systems scientific software must evolve across the entire
stack to exploit the multiple levels of parallelism exposed in modern
architectures. In this paper we demonstrate the use of hybrid MPI/OpenMP
parallelisation to optimise parallel sparse matrix-vector multiplication in
PETSc, a widely used scientific library for the scalable solution of partial
differential equations. Using large matrices generated by Fluidity, an open
source CFD application code which uses PETSc as its linear solver engine, we
evaluate the effect of explicit communication overlap using task-based
parallelism and show how to further improve performance by explicitly load
balancing threads within MPI processes. We demonstrate a significant speedup
over the pure-MPI mode and efficient strong scaling of sparse matrix-vector
multiplication on Fujitsu PRIMEHPC FX10 and Cray XE6 systems
Reliable and Efficient In-Memory Fault Tolerance of Large Language Model Pretraining
Extensive system scales (i.e. thousands of GPU/TPUs) and prolonged training
periods (i.e. months of pretraining) significantly escalate the probability of
failures when training large language models (LLMs). Thus, efficient and
reliable fault-tolerance methods are in urgent need. Checkpointing is the
primary fault-tolerance method to periodically save parameter snapshots from
GPU memory to disks via CPU memory. In this paper, we identify the frequency of
existing checkpoint-based fault-tolerance being significantly limited by the
storage I/O overheads, which results in hefty re-training costs on restarting
from the nearest checkpoint. In response to this gap, we introduce an in-memory
fault-tolerance framework for large-scale LLM pretraining. The framework boosts
the efficiency and reliability of fault tolerance from three aspects: (1)
Reduced Data Transfer and I/O: By asynchronously caching parameters, i.e.,
sharded model parameters, optimizer states, and RNG states, to CPU volatile
memory, Our framework significantly reduces communication costs and bypasses
checkpoint I/O. (2) Enhanced System Reliability: Our framework enhances
parameter protection with a two-layer hierarchy: snapshot management processes
(SMPs) safeguard against software failures, together with Erasure Coding (EC)
protecting against node failures. This double-layered protection greatly
improves the survival probability of the parameters compared to existing
checkpointing methods. (3) Improved Snapshotting Frequency: Our framework
achieves more frequent snapshotting compared with asynchronous checkpointing
optimizations under the same saving time budget, which improves the fault
tolerance efficiency. Empirical results demonstrate that Our framework
minimizes the overhead of fault tolerance of LLM pretraining by effectively
leveraging redundant CPU resources.Comment: Fault Tolerance, Checkpoint Optimization, Large Language Model, 3D
parallelis
Deterministic Consistency: A Programming Model for Shared Memory Parallelism
The difficulty of developing reliable parallel software is generating
interest in deterministic environments, where a given program and input can
yield only one possible result. Languages or type systems can enforce
determinism in new code, and runtime systems can impose synthetic schedules on
legacy parallel code. To parallelize existing serial code, however, we would
like a programming model that is naturally deterministic without language
restrictions or artificial scheduling. We propose "deterministic consistency",
a parallel programming model as easy to understand as the "parallel assignment"
construct in sequential languages such as Perl and JavaScript, where concurrent
threads always read their inputs before writing shared outputs. DC supports
common data- and task-parallel synchronization abstractions such as fork/join
and barriers, as well as non-hierarchical structures such as producer/consumer
pipelines and futures. A preliminary prototype suggests that software-only
implementations of DC can run applications written for popular parallel
environments such as OpenMP with low (<10%) overhead for some applications.Comment: 7 pages, 3 figure
Workload distribution for ray tracing in multi-core systems
One of the features that made interactive ray tracing possible over the last few years was the careful exploitation of the computational power and parallelism available on modern multicore processors. Multithreaded interactive ray tracing engines have to share the workload (rays to be processed) among rendering threads. This may be achieved by storing tasks on a shared FIFO-queue, accessed by all threads. Accessing this shared data structure requires a data access control mechanism, which ensures that the data structure is not corrupted. This access mechanism must incur minimal overheads such that performance is not penalized. This paper proposes a lock-free data access control mechanism to such queue, which avoids all locks by carefully reordering instructions. This technique
is compared with a classical lock-based approach and with a conservative local technique, where each thread
maintains its local queue of tasks and shares nothing with other threads. Although the local approach outperforms
the other two due to very good load balancing conditions, we demonstrate that the lock-free approach outperforms
the lock-based one for large processor counts. Efficient and reliable sharing of data structures within a shared
memory system is becoming a very relevant problem with the advent of many core processors. Lock free approaches are a promising manner of achieving such goal
A Flexible Framework For Implementing Multi-Nested Software Transaction Memory
Programming with locks is very difficult in multi-threaded programmes. Concurrency control of access to shared data limits scalable locking strategies otherwise provided for in software transaction memory. This work addresses the subject of creating dependable software in the face of eminent failures. In the past, programmers who used lock-based synchronization to implement concurrent access to shared data had to grapple with problems with conventional locking techniques such as deadlocks, convoying, and priority inversion. This paper proposes another advanced feature for Dynamic Software Transactional Memory intended to extend the concepts of transaction processing to provide a nesting mechanism and efficient lock-free synchronization, recoverability and restorability. In addition, the code for implementation has also been researched, coded, tested, and implemented to achieve the desired objectives
- …