Search CORE

53 research outputs found

Extending Transactional Memory with Atomic Deferral

Author: Luchangco Victor
Spear Michael
Zhou Tingzhe
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 21st International Conference on Principles of Distributed Systems (OPODIS 2017)
Publication date: 01/01/2018
Field of study

This paper introduces atomic deferral, an extension to TM that allows programmers to move long-running or irrevocable operations out of a transaction while maintaining serializability: the transaction and its de- ferred operation appear to execute atomically from the perspective of other transactions. Thus, program- mers can adapt lock-based programs to exploit TM with relatively little effort and without sacrificing scalability by atomically deferring the problematic operations. We demonstrate this with several use cases for atomic deferral, as well as an in-depth analysis of its use on the PARSEC dedup benchmark, where we show that atomic deferral enables TM to be competitive with well-designed lock-based code

Dagstuhl Research Online Publication Server

ActiveMonitor: Asynchronous Monitor Framework for Scalability and Multi-Object Synchronization

Author: Chauhan Himanshu
Garg Vijay K.
Hung Wei-Lun
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 19th International Conference on Principles of Distributed Systems (OPODIS 2015)
Publication date: 01/01/2016
Field of study

Monitor objects are used extensively for thread-safety and synchronization in shared memory parallel programs. They provide ease of use, and enable straightforward correctness analysis. However, they inhibit parallelism by enforcing serial executions of critical sections, and thus the performance of parallel programs with monitors scales poorly with number of processes. Their current design and implementation is also ill-suited for thread synchronization across multiple thread-safe objects. We present ActiveMonitor - a framework that allows multi-object synchronization without global locks, and improves parallelism by exploiting asynchronous execution of critical sections. We evaluate the performance of Java based implementation of ActiveMonitor on micro-benchmarks involving light and heavy critical sections, as well as on single-source-shortest-path problem in directed graphs. Our results show that on most of these problems, ActiveMonitor based programs outperform programs implemented using Java\u27s reentrant-lock and condition constructs

Dagstuhl Research Online Publication Server

Efficient Race Detection with Futures

Author: Agrawal Kunal
Fineman Jeremy
Lee I-Ting Angelina
Utterback Robert
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 03/01/2019
Field of study

This paper addresses the problem of provably efficient and practically good on-the-fly determinacy race detection in task parallel programs that use futures. Prior works determinacy race detection have mostly focused on either task parallel programs that follow a series-parallel dependence structure or ones with unrestricted use of futures that generate arbitrary dependences. In this work, we consider a restricted use of futures and show that it can be race detected more efficiently than general use of futures. Specifically, we present two algorithms: MultiBags and MultiBags+. MultiBags targets programs that use futures in a restricted fashion and runs in time

O(T_1 \alpha(m,n))

, where

T_1

is the sequential running time of the program,

\alpha

is the inverse Ackermann's function,

m

is the total number of memory accesses,

n

is the dynamic count of places at which parallelism is created. Since

\alpha

is a very slowly growing function (upper bounded by

4

for all practical purposes), it can be treated as a close-to-constant overhead. MultiBags+ an extension of MultiBags that target programs with general use of futures. It runs in time

O((T_1+k^2)\alpha(m,n))

where

T_1

\alpha

m

and

n

are defined as before, and

k

is the number of future operations in the computation. We implemented both algorithms and empirically demonstrate their efficiency

arXiv.org e-Print Archive

Crossref

pocl: A Performance-Portable OpenCL Implementation

Author: Berg Heikki
de La Lama Carlos Sánchez
Jääskeläinen Pekka
Raiskila Kalle
Schnetter Erik
Takala Jarmo
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

OpenCL is a standard for parallel programming of heterogeneous systems. The benefits of a common programming standard are clear; multiple vendors can provide support for application descriptions written according to the standard, thus reducing the program porting effort. While the standard brings the obvious benefits of platform portability, the performance portability aspects are largely left to the programmer. The situation is made worse due to multiple proprietary vendor implementations with different characteristics, and, thus, required optimization strategies. In this paper, we propose an OpenCL implementation that is both portable and performance portable. At its core is a kernel compiler that can be used to exploit the data parallelism of OpenCL programs on multiple platforms with different parallel hardware styles. The kernel compiler is modularized to perform target-independent parallel region formation separately from the target-specific parallel mapping of the regions to enable support for various styles of fine-grained parallel resources such as subword SIMD extensions, SIMD datapaths and static multi-issue. Unlike previous similar techniques that work on the source level, the parallel region formation retains the information of the data parallelism using the LLVM IR and its metadata infrastructure. This data can be exploited by the later generic compiler passes for efficient parallelization. The proposed open source implementation of OpenCL is also platform portable, enabling OpenCL on a wide range of architectures, both already commercialized and on those that are still under research. The paper describes how the portability of the implementation is achieved. Our results show that most of the benchmarked applications when compiled using pocl were faster or close to as fast as the best proprietary OpenCL implementation for the platform at hand.Comment: This article was published in 2015; it is now openly accessible via arxi

arXiv.org e-Print Archive

Trepo - Institutional Repository of Tampere University

Fast Nonblocking Persistence for Concurrent Data Structures

Author: Abdallah Shreif
Cai Wentao
Du Mingzhe
Maksimovski Vladimir
Sanna Rafaello
Scott Michael L.
Wen Haosen
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 35th International Symposium on Distributed Computing (DISC 2021)
Publication date: 01/01/2021
Field of study

We present a fully lock-free variant of our recent Montage system for persistent data structures. The variant, nbMontage, adds persistence to almost any nonblocking concurrent structure without introducing significant overhead or blocking of any kind. Like its predecessor, nbMontage is buffered durably linearizable: it guarantees that the state recovered in the wake of a crash will represent a consistent prefix of pre-crash execution. Unlike its predecessor, nbMontage ensures wait-free progress of the persistence frontier, thereby bounding the number of recent updates that may be lost on a crash, and allowing a thread to force an update of the frontier (i.e., to perform a sync operation) without the risk of blocking. As an extra benefit, the helping mechanism employed by our wait-free sync significantly reduces its latency. Performance results for nonblocking queues, skip lists, trees, and hash tables rival custom data structures in the literature - dramatically faster than achieved with prior general-purpose systems, and generally within 50% of equivalent non-persistent structures placed in DRAM

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

A Comprehensive Survey on Distributed Training of Graph Neural Networks

Author: Chen Wenguang
Fan Dongrui
Lin Haiyang
Pan Shirui
Xie Yuan
Yan Mingyu
Ye Xiaochun
Publication venue
Publication date: 29/11/2023
Field of study

Graph neural networks (GNNs) have been demonstrated to be a powerful algorithmic model in broad application fields for their effectiveness in learning over graphs. To scale GNN training up for large-scale and ever-growing graphs, the most promising solution is distributed training which distributes the workload of training across multiple computing nodes. At present, the volume of related research on distributed GNN training is exceptionally vast, accompanied by an extraordinarily rapid pace of publication. Moreover, the approaches reported in these studies exhibit significant divergence. This situation poses a considerable challenge for newcomers, hindering their ability to grasp a comprehensive understanding of the workflows, computational patterns, communication strategies, and optimization techniques employed in distributed GNN training. As a result, there is a pressing need for a survey to provide correct recognition, analysis, and comparisons in this field. In this paper, we provide a comprehensive survey of distributed GNN training by investigating various optimization techniques used in distributed GNN training. First, distributed GNN training is classified into several categories according to their workflows. In addition, their computational patterns and communication patterns, as well as the optimization techniques proposed by recent work are introduced. Second, the software frameworks and hardware platforms of distributed GNN training are also introduced for a deeper understanding. Third, distributed GNN training is compared with distributed training of deep neural networks, emphasizing the uniqueness of distributed GNN training. Finally, interesting issues and opportunities in this field are discussed.Comment: To Appear in Proceedings of the IEE

arXiv.org e-Print Archive

Tailoring Transactional Memory to Real-World Applications

Author: Zhou Tingzhe
Publication venue: Lehigh Preserve
Publication date
Field of study

Transactional Memory (TM) promises to provide a scalable mechanism for synchronizationin concurrent programs, and to offer ease-of-use benefits to programmers. Since multiprocessorarchitectures have dominated CPU design, exploiting parallelism in program

Lehigh University: Lehigh Preserve

Easier Parallel Programming with Provably-Efficient Runtime Schedulers

Author: Utterback Robert
Publication venue: Washington University Open Scholarship
Publication date: 15/08/2017
Field of study

Over the past decade processor manufacturers have pivoted from increasing uniprocessor performance to multicore architectures. However, utilizing this computational power has proved challenging for software developers. Many concurrency platforms and languages have emerged to address parallel programming challenges, yet writing correct and performant parallel code retains a reputation of being one of the hardest tasks a programmer can undertake. This dissertation will study how runtime scheduling systems can be used to make parallel programming easier. We address the difficulty in writing parallel data structures, automatically finding shared memory bugs, and reproducing non-deterministic synchronization bugs. Each of the systems presented depends on a novel runtime system which provides strong theoretical performance guarantees and performs well in practice

Washington University St. Louis: Open Scholarship