Search CORE

7,298 research outputs found

A Conflict-Resilient Lock-Free Calendar Queue for Scalable Share-Everything PDES Platforms

Author: Ianni Mauro
Marotta Romolo
Pellegrini Alessandro
Quaglia Francesco
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2017
Field of study

Emerging share-everything Parallel Discrete Event Simulation (PDES) platforms rely on worker threads fully sharing the workload of events to be processed. These platforms require efficient event pool data structures enabling high concurrency of extraction/insertion operations. Non-blocking event pool algorithms are raising as promising solutions for this problem. However, the classical non-blocking paradigm leads concurrent conflicting operations, acting on a same portion of the event pool data structure, to abort and then retry. In this article we present a conflict-resilient non-blocking calendar queue that enables conflicting dequeue operations, concurrently attempting to extract the minimum element, to survive, thus improving the level of scalability of accesses to the hot portion of the data structure---namely the bucket to which the current locality of the events to be processed is bound. We have integrated our solution within an open source share-everything PDES platform and report the results of an experimental analysis of the proposed concurrent data structure compared to some literature solutions

Crossref

ART

Archivio della ricerca- Università di Roma La Sapienza

BISMO: A Scalable Bit-Serial Matrix Multiplication Overlay for Reconfigurable Computing

Author: Rasnayake Lahiru
Sjalander Magnus
Umuroglu Yaman
Publication venue
Publication date: 01/01/2018
Field of study

Matrix-matrix multiplication is a key computational kernel for numerous applications in science and engineering, with ample parallelism and data locality that lends itself well to high-performance implementations. Many matrix multiplication-dependent applications can use reduced-precision integer or fixed-point representations to increase their performance and energy efficiency while still offering adequate quality of results. However, precision requirements may vary between different application phases or depend on input data, rendering constant-precision solutions ineffective. We present BISMO, a vectorized bit-serial matrix multiplication overlay for reconfigurable computing. BISMO utilizes the excellent binary-operation performance of FPGAs to offer a matrix multiplication performance that scales with required precision and parallelism. We characterize the resource usage and performance of BISMO across a range of parameters to build a hardware cost model, and demonstrate a peak performance of 6.5 TOPS on the Xilinx PYNQ-Z1 board.Comment: To appear at FPL'1

arXiv.org e-Print Archive

Crossref

NORA - Norwegian Open Research Archives

A Scalable, Portable, and Memory-Efficient Lock-Free FIFO Queue

Author: Nikolaev Ruslan
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 33rd International Symposium on Distributed Computing (DISC 2019)
Publication date: 01/01/2019
Field of study

We present a new lock-free multiple-producer and multiple-consumer (MPMC) FIFO queue design which is scalable and, unlike existing high-performant queues, very memory efficient. Moreover, the design is ABA safe and does not require any external memory allocators or safe memory reclamation techniques, typically needed by other scalable designs. In fact, this queue itself can be leveraged for object allocation and reclamation, as in data pools. We use FAA (fetch-and-add), a specialized and more scalable than CAS (compare-and-set) instruction, on the most contended hot spots of the algorithm. However, unlike prior attempts with FAA, our queue is both lock-free and linearizable. We propose a general approach, SCQ, for bounded queues. This approach can easily be extended to support unbounded FIFO queues which can store an arbitrary number of elements. SCQ is portable across virtually all existing architectures and flexible enough for a wide variety of uses. We measure the performance of our algorithm on the x86-64 and PowerPC architectures. Our evaluation validates that our queue has exceptional memory efficiency compared to other algorithms and its performance is often comparable to, or exceeding that of state-of-the-art scalable algorithms

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Are Lock-Free Concurrent Algorithms Practically Wait-Free?

Author: Alistarh Dan
Censor-Hillel Keren
Shavit Nir
Publication venue
Publication date: 15/11/2013
Field of study

Lock-free concurrent algorithms guarantee that some concurrent operation will always make progress in a finite number of steps. Yet programmers prefer to treat concurrent code as if it were wait-free, guaranteeing that all operations always make progress. Unfortunately, designing wait-free algorithms is generally a very complex task, and the resulting algorithms are not always efficient. While obtaining efficient wait-free algorithms has been a long-time goal for the theory community, most non-blocking commercial code is only lock-free. This paper suggests a simple solution to this problem. We show that, for a large class of lock- free algorithms, under scheduling conditions which approximate those found in commercial hardware architectures, lock-free algorithms behave as if they are wait-free. In other words, programmers can keep on designing simple lock-free algorithms instead of complex wait-free ones, and in practice, they will get wait-free progress. Our main contribution is a new way of analyzing a general class of lock-free algorithms under a stochastic scheduler. Our analysis relates the individual performance of processes with the global performance of the system using Markov chain lifting between a complex per-process chain and a simpler system progress chain. We show that lock-free algorithms are not only wait-free with probability 1, but that in fact a general subset of lock-free algorithms can be closely bounded in terms of the average number of steps required until an operation completes. To the best of our knowledge, this is the first attempt to analyze progress conditions, typically stated in relation to a worst case adversary, in a stochastic model capturing their expected asymptotic behavior.Comment: 25 page

arXiv.org e-Print Archive

CiteSeerX

DSpace@MIT

Crossref

Resolutions of the Coulomb operator: VI. Computation of auxiliary integrals

Author: Gill Peter M. W.
Hollett Joshua W.
Limpanuparb Taweetham
Publication venue: 'AIP Publishing'
Publication date: 08/01/2013
Field of study

We discuss the efficient computation of the auxiliary integrals that arise when resolutions of two-electron operators (specifically, the Coulomb and long-range Ewald operators) are employed in quantum chemical calculations. We derive a recurrence relation that facilitates the generation of auxiliary integrals for Gaussian basis functions of arbitrary angular momentum and propose a near-optimal algorithm for its use

arXiv.org e-Print Archive

The Australian National University