7,298 research outputs found
A Conflict-Resilient Lock-Free Calendar Queue for Scalable Share-Everything PDES Platforms
Emerging share-everything Parallel Discrete Event Simulation (PDES) platforms rely on worker threads fully sharing the workload of events to be processed. These platforms require efficient event pool data structures enabling high concurrency of extraction/insertion operations. Non-blocking event pool algorithms are raising as promising solutions for this problem. However, the classical non-blocking paradigm leads concurrent conflicting operations, acting on a same portion of the event pool data structure, to abort and then retry. In this article we present a conflict-resilient non-blocking calendar queue that enables conflicting dequeue operations, concurrently attempting to extract the minimum element, to survive, thus improving the level of scalability of accesses to the hot portion of the data structure---namely the bucket to which the current locality of the events to be processed is bound. We have integrated our solution within an open source share-everything PDES platform and report the results of an experimental analysis of the proposed concurrent data structure compared to some literature solutions
BISMO: A Scalable Bit-Serial Matrix Multiplication Overlay for Reconfigurable Computing
Matrix-matrix multiplication is a key computational kernel for numerous
applications in science and engineering, with ample parallelism and data
locality that lends itself well to high-performance implementations. Many
matrix multiplication-dependent applications can use reduced-precision integer
or fixed-point representations to increase their performance and energy
efficiency while still offering adequate quality of results. However, precision
requirements may vary between different application phases or depend on input
data, rendering constant-precision solutions ineffective. We present BISMO, a
vectorized bit-serial matrix multiplication overlay for reconfigurable
computing. BISMO utilizes the excellent binary-operation performance of FPGAs
to offer a matrix multiplication performance that scales with required
precision and parallelism. We characterize the resource usage and performance
of BISMO across a range of parameters to build a hardware cost model, and
demonstrate a peak performance of 6.5 TOPS on the Xilinx PYNQ-Z1 board.Comment: To appear at FPL'1
A Scalable, Portable, and Memory-Efficient Lock-Free FIFO Queue
We present a new lock-free multiple-producer and multiple-consumer (MPMC) FIFO queue design which is scalable and, unlike existing high-performant queues, very memory efficient. Moreover, the design is ABA safe and does not require any external memory allocators or safe memory reclamation techniques, typically needed by other scalable designs. In fact, this queue itself can be leveraged for object allocation and reclamation, as in data pools. We use FAA (fetch-and-add), a specialized and more scalable than CAS (compare-and-set) instruction, on the most contended hot spots of the algorithm. However, unlike prior attempts with FAA, our queue is both lock-free and linearizable.
We propose a general approach, SCQ, for bounded queues. This approach can easily be extended to support unbounded FIFO queues which can store an arbitrary number of elements. SCQ is portable across virtually all existing architectures and flexible enough for a wide variety of uses. We measure the performance of our algorithm on the x86-64 and PowerPC architectures. Our evaluation validates that our queue has exceptional memory efficiency compared to other algorithms and its performance is often comparable to, or exceeding that of state-of-the-art scalable algorithms
Are Lock-Free Concurrent Algorithms Practically Wait-Free?
Lock-free concurrent algorithms guarantee that some concurrent operation will
always make progress in a finite number of steps. Yet programmers prefer to
treat concurrent code as if it were wait-free, guaranteeing that all operations
always make progress. Unfortunately, designing wait-free algorithms is
generally a very complex task, and the resulting algorithms are not always
efficient. While obtaining efficient wait-free algorithms has been a long-time
goal for the theory community, most non-blocking commercial code is only
lock-free.
This paper suggests a simple solution to this problem. We show that, for a
large class of lock- free algorithms, under scheduling conditions which
approximate those found in commercial hardware architectures, lock-free
algorithms behave as if they are wait-free. In other words, programmers can
keep on designing simple lock-free algorithms instead of complex wait-free
ones, and in practice, they will get wait-free progress.
Our main contribution is a new way of analyzing a general class of lock-free
algorithms under a stochastic scheduler. Our analysis relates the individual
performance of processes with the global performance of the system using Markov
chain lifting between a complex per-process chain and a simpler system progress
chain. We show that lock-free algorithms are not only wait-free with
probability 1, but that in fact a general subset of lock-free algorithms can be
closely bounded in terms of the average number of steps required until an
operation completes.
To the best of our knowledge, this is the first attempt to analyze progress
conditions, typically stated in relation to a worst case adversary, in a
stochastic model capturing their expected asymptotic behavior.Comment: 25 page
Resolutions of the Coulomb operator: VI. Computation of auxiliary integrals
We discuss the efficient computation of the auxiliary integrals that arise
when resolutions of two-electron operators (specifically, the Coulomb and
long-range Ewald operators) are employed in quantum chemical calculations. We
derive a recurrence relation that facilitates the generation of auxiliary
integrals for Gaussian basis functions of arbitrary angular momentum and
propose a near-optimal algorithm for its use
- …