68,091 research outputs found
Synchronization of processes
The study of the synchronization of processes is a very interesting field. It-brings together concepts that have originated in the design of operating systems, and of high level programming languages. Also it is becoming clear that the design of algorithms for parallel execution is intimately connected with synchronization problems. Some specialized synchronization problems have arisen in the design of data base systems. Indeed, distributed data bases provide an example of distributed processing that has immense practical significance. To summarize, synchronization of processes is a universal activity whose importance is being felt throughout computer science. The time has therefore come for the synchronization of processes to be studied as a topic in its own right. In this course I am taking such a broad viewpoint, and am trying to integrate some aspects of operating systems, languages, and parallel algorithms. However, this being a first attempt, the integration is not as thorough as I would have wished. Also, in the short time at my disposal, I am not able to discuss several very important topics, such as reliability
Recommended from our members
A model of time dependent behavior in concurrent software systems
A great difficulty in building distributed systems lies in being able to predict what the systems behavior will be. A distributed or communicating system is defined here to be one in in which the hardware consists of a set of processors each with their own memory, connected by some communication medium (there is no shared memory), and the software is assumed to be of the CSP (Hoare's Communicating Sequential Processes) type.In the past few years some theories have been proposed to model features of communicating systems. Milner's Calculus of communicating Systems (CCS), Winskel's Synchronization Trees (ST), Hennessy's Acceptance Trees (AT), and Hoare and Brookes's theory of communicating processes are examples of formal models of such systems. All of these models concentrate on modelling observable properties of a system.Event Dependency Trees (EDT) is a new representation of communicating systems that models the time dependent nature of such systems. None of the representations mentioned above explicitly represent time but time is precisely the factor that introduces so much variability and complexity into such software and systems. EDT provides a representation based on trees and a set of operations over the EDT trees that can be used to produce deadlock-free software. The model supplies potentially important information for the design and construction of distributed, parallel software systems
Workstation Clusters for Parallel Computing
Workstation clusters have become an increasingly popular alternative to traditional parallel supercomputers for many workloads requiring high performance computing. The use of parallel computing for scientific simulations has increased tremendously in the last ten years, and parallel implementations of scientific simulation codes are now in widespread use. There are two dominant parallel hardware/software architectures in use today: distributed memory, and shared memory. Systems implementing shared memory provide cooperating processes with a shared memory address space that can be accessed by all processors. In shared memory systems, parallel processing occurs through the use of shared data structures, or through emulation of message passing semantics in software. Distributed memory systems are composed of a number of interconnected computational nodes, which do not share memory, but can communicate with each other through a high-performance network of some kind. Parallelism is achieved on distributed memory systems with multiple copies of the parallel program running on different nodes, sending messages to each other to coordinate computations. The messages used in a distributed memory parallel program typically contain application data, synchronization information, and other data that controls the execution of the parallel program
Task-aware LPF: integrating a model-compliant communication layer with task-based programming models
The rapid advancement of high-performance computing (HPC) systems has led to the emergence of exascale computing, characterized by distributed memory nodes and high parallel computing capabilities. To effectively utilize these systems, the HPC community has embraced programming models that harness both inter-node and intra-node parallelism. Inter-node parallelism is typically addressed using distributed-memory programming models like MPI and GASPI, while intra-node parallelism is exploited through shared-memory programming models such as OpenMP and OmpSs-2. However, the two-sided communication model used in MPI, which requires both the sender and receiver processes to post an operation, can impose performance limitations due to the inherent synchronization. In contrast, one-sided communication models like GASPI and Lightweight Parallel Foundations (LPF) leverage modern network fabric features and remote direct memory access (RDMA) to efficiently exchange data in distributed memory systems without the need for explicit receive operations. In this project, we combine the Bulk Synchronous Parallel (BSP) model of LPF with the data-flow model of OmpSs-2 to exploit parallelism at both intra-node and inter-node levels. This approach maintains the simplicity of the BSP model and the performance of the data-flow model. By enabling optimal overlap between computation, communication, and synchronization phases, we effectively utilize available resources. The flexibility of the data-flow model allows for adjusting computation tasks that are not tightly bound to BSP model phases, facilitating early or delayed execution based on resource availability. To optimize the BSP model, new zero-cost synchronization methods are designed, improving performance and flexibility. These methods offer localized synchronization but require a fixed communication pattern or user-defined criteria, limiting programmability. Additionally, bi-directional communication is often required, necessitating the inclusion of empty messages in applications without bi-directional communication. Our implementation is evaluated against Task-Aware MPI (TAMPI), demonstrating that with a single coarse-grained synchronization primitive, we can still hide synchronization overheads and reach competitive performance. The results show that the zero-cost synchronization methods perform similarly to TAMPI, indicating that coarse synchronization is sufficient for iterative applications. The evaluation highlights the effectiveness of the proposed approach in improving performance and programmability in HPC applications
A shared memory algorithm and proof for the alternative construct in CSP
technical reportCommunicating Sequential Processes (CSP) is a paradigm for communication and synchronization among distributed processes. The alternative construct is a key feature of CSP that allows nondeterministic selection of one among several possible communicants. Previous algorithms for this construct assume a message passing architecture and are not appropriate for multiprocessor systems that feature shared memory. This paper describes a distributed algorithm for the alternative construct that exploits the capabilities of a parallel computer with shared memory. The algorithm assumes a generalized version of Hoare's original alternative construct that allows output commands to be included in guards. A correctness proof of the proposed algorithm is presented to show that the algorithm conforms to some safety and liveness criteria. Extensions to allow termination of processes and to ensure fairness in guard selection are also given. Keywords: communicating sequential processes; alternative operation; shared memory multiprocessor; parallel processing
Garbage Collection for General Graphs
Garbage collection is moving from being a utility to a requirement of every modern programming language. With multi-core and distributed systems, most programs written recently are heavily multi-threaded and distributed. Distributed and multi-threaded programs are called concurrent programs. Manual memory management is cumbersome and difficult in concurrent programs. Concurrent programming is characterized by multiple independent processes/threads, communication between processes/threads, and uncertainty in the order of concurrent operations. The uncertainty in the order of operations makes manual memory management of concurrent programs difficult. A popular alternative to garbage collection in concurrent programs is to use smart pointers. Smart pointers can collect all garbage only if developer identifies cycles being created in the reference graph. Smart pointer usage does not guarantee protection from memory leaks unless cycle can be detected as process/thread create them. General garbage collectors, on the other hand, can avoid memory leaks, dangling pointers, and double deletion problems in any programming environment without help from the programmer. Concurrent programming is used in shared memory and distributed memory systems. State of the art shared memory systems use a single concurrent garbage collector thread that processes the reference graph. Distributed memory systems have very few complete garbage collection algorithms and those that exist use global barriers, are centralized and do not scale well. This thesis focuses on designing garbage collection algorithms for shared memory and distributed memory systems that satisfy the following properties: concurrent, parallel, scalable, localized (decentralized), low pause time, high promptness, no global synchronization, safe, complete, and operates in linear time
Fault-Tolerant Adaptive Parallel and Distributed Simulation
Discrete Event Simulation is a widely used technique that is used to model
and analyze complex systems in many fields of science and engineering. The
increasingly large size of simulation models poses a serious computational
challenge, since the time needed to run a simulation can be prohibitively
large. For this reason, Parallel and Distributes Simulation techniques have
been proposed to take advantage of multiple execution units which are found in
multicore processors, cluster of workstations or HPC systems. The current
generation of HPC systems includes hundreds of thousands of computing nodes and
a vast amount of ancillary components. Despite improvements in manufacturing
processes, failures of some components are frequent, and the situation will get
worse as larger systems are built. In this paper we describe FT-GAIA, a
software-based fault-tolerant extension of the GAIA/ART\`IS parallel simulation
middleware. FT-GAIA transparently replicates simulation entities and
distributes them on multiple execution nodes. This allows the simulation to
tolerate crash-failures of computing nodes; furthermore, FT-GAIA offers some
protection against byzantine failures since synchronization messages are
replicated as well, so that the receiving entity can identify and discard
corrupted messages. We provide an experimental evaluation of FT-GAIA on a
running prototype. Results show that a high degree of fault tolerance can be
achieved, at the cost of a moderate increase in the computational load of the
execution units.Comment: Proceedings of the IEEE/ACM International Symposium on Distributed
Simulation and Real Time Applications (DS-RT 2016
Taming Unbalanced Training Workloads in Deep Learning with Partial Collective Operations
Load imbalance pervasively exists in distributed deep learning training
systems, either caused by the inherent imbalance in learned tasks or by the
system itself. Traditional synchronous Stochastic Gradient Descent (SGD)
achieves good accuracy for a wide variety of tasks, but relies on global
synchronization to accumulate the gradients at every training step. In this
paper, we propose eager-SGD, which relaxes the global synchronization for
decentralized accumulation. To implement eager-SGD, we propose to use two
partial collectives: solo and majority. With solo allreduce, the faster
processes contribute their gradients eagerly without waiting for the slower
processes, whereas with majority allreduce, at least half of the participants
must contribute gradients before continuing, all without using a central
parameter server. We theoretically prove the convergence of the algorithms and
describe the partial collectives in detail. Experimental results on
load-imbalanced environments (CIFAR-10, ImageNet, and UCF101 datasets) show
that eager-SGD achieves 1.27x speedup over the state-of-the-art synchronous
SGD, without losing accuracy.Comment: Published in Proceedings of the 25th ACM SIGPLAN Symposium on
Principles and Practice of Parallel Programming (PPoPP'20), pp. 45-61. 202
Lock-free Concurrent Data Structures
Concurrent data structures are the data sharing side of parallel programming.
Data structures give the means to the program to store data, but also provide
operations to the program to access and manipulate these data. These operations
are implemented through algorithms that have to be efficient. In the sequential
setting, data structures are crucially important for the performance of the
respective computation. In the parallel programming setting, their importance
becomes more crucial because of the increased use of data and resource sharing
for utilizing parallelism.
The first and main goal of this chapter is to provide a sufficient background
and intuition to help the interested reader to navigate in the complex research
area of lock-free data structures. The second goal is to offer the programmer
familiarity to the subject that will allow her to use truly concurrent methods.Comment: To appear in "Programming Multi-core and Many-core Computing
Systems", eds. S. Pllana and F. Xhafa, Wiley Series on Parallel and
Distributed Computin
- …