22 research outputs found

    Brief Announcement: Jiffy: A Fast, Memory Efficient, Wait-Free Multi-Producers Single-Consumer Queue

    Get PDF
    In applications such as sharded data processing systems, data flow programming and load sharing applications, multiple concurrent data producers are feeding requests into the same data consumer. This can be naturally realized through concurrent queues, where each consumer pulls its tasks from its dedicated queue. For scalability, wait-free queues are often preferred over lock based structures. The vast majority of wait-free queue implementations, and even lock-free ones, support the multi-producer multi-consumer model. Yet, this comes at a premium, since implementing wait-free multi-producer multi-consumer queues requires utilizing complex helper data structures. The latter increases the memory consumption of such queues and limits their performance and scalability. Additionally, many such designs employ (hardware) cache unfriendly memory access patterns. In this work we study the implementation of wait-free multi-producer single-consumer queues. Specifically, we propose Jiffy, an efficient memory frugal novel wait-free multi-producer single-consumer queue and formally prove its correctness. We then compare the performance and memory requirements of Jiffy with other state of the art lock-free and wait-free queues. We show that indeed Jiffy can maintain good performance with up to 128 threads, delivers better throughput than other constructions we compared against, and consumes less memory

    Single-Producer/Single-Consumer Queues on Shared Cache Multi-Core Systems

    Full text link
    Using efficient point-to-point communication channels is critical for implementing fine grained parallel program on modern shared cache multi-core architectures. This report discusses in detail several implementations of wait-free Single-Producer/Single-Consumer queue (SPSC), and presents a novel and efficient algorithm for the implementation of an unbounded wait-free SPSC queue (uSPSC). The correctness proof of the new algorithm, and several performance measurements based on simple synthetic benchmark and microbenchmark, are also discussed

    Compresión BZIP2 optimizada usando colas libres de bloqueo

    Get PDF
    Since the general trend nowadays is to have more and more processors (cores) available in each computer, the scalability of the data structures used in parallel programs must be carefully considered in order to guarantee that they take full advantage of the available processors. Because of increased containment, lock-based data structures usually do not perform proportionally as the number of processors increases. The use of well-designed lock-free data structures, like first in-first out, (fifo) queues, can boost the performance of a parallel program when many processors are available. In this work, a parallel version of bzip2, a popular compression and decompression program, is designed and implemented by using lock-free queues instead of the lock-based ones, and applying a two-buffer-output strategy. The performance of lock-free implementation is measured against lock-based implementations. Compression time was measured with different number of processors and different block sizes. Consistent with our hypothesis, the results show that our parallel lock-free implementation outperforms the other implementations.Debido a que la tendencia actual es tener más y más procesadores (cores) disponibles en cada computadora, la escalabilidad de las estructuras de datos usadas en programación paralela debe ser considerada cuidadosamente, para así garantizar que ellas saquen ventaja de los procesadores disponibles. Debido al aumento en la contención, usualmente las estructuras de datos basadas en bloqueos no mejoran su rendimiento proporcionalmente al incrementar el número de procesadores. El uso de estructuras de datos libres de bloqueos bien diseñadas, tales como las colas first in-first out, puede mejorar el rendimiento de un programa paralelo, cuando hay varios procesadores disponibles. En este trabajo se diseña e implementa una versión paralela de bzip2, un programa para compresión y descompresión de datos muy popular, usando colas libres de bloqueos en lugar de las basadas en bloqueos, y aplicando una estrategia de dos buffers de salida. Se compara el rendimiento de la implementación libre de bloqueos contra implementaciones basadas en bloqueos.  Se midió el tiempo de compresión usando diferente número de procesadores y diferentes tamaños de bloques.  Coincidiendo con la hipótesis de trabajo, los resultados muestran que la implementación paralela libre de bloqueos supera las otras implementaciones

    The java.util.concurrent synchronizer framework

    Get PDF
    AbstractMost synchronizers (locks, barriers, etc.) in the J2SE 5.0 java.util.concurrent package are constructed using a small framework based on class AbstractQueuedSynchronizer. This framework provides common mechanics for atomically managing synchronization state, blocking and unblocking threads, and queuing. The paper describes the rationale, design, implementation, usage, and performance of this framework

    IST Austria Technical Report

    Get PDF
    Linearizability requires that the outcome of calls by competing threads to a concurrent data structure is the same as some sequential execution where each thread has exclusive access to the data structure. In an ordered data structure, such as a queue or a stack, linearizability is ensured by requiring threads commit in the order dictated by the sequential semantics of the data structure; e.g., in a concurrent queue implementation a dequeue can only remove the oldest element. In this paper, we investigate the impact of this strict ordering, by comparing what linearizability allows to what existing implementations do. We first give an operational definition for linearizability which allows us to build the most general linearizable implementation as a transition system for any given sequential specification. We then use this operational definition to categorize linearizable implementations based on whether they are bound or free. In a bound implementation, whenever all threads observe the same logical state, the updates to the logical state and the temporal order of commits coincide. All existing queue implementations we know of are bound. We then proceed to present, to the best of our knowledge, the first ever free queue implementation. Our experiments show that free implementations have the potential for better performance by suffering less from contention

    Real-Time Wait-Free Queues using Micro-Transactions

    Get PDF

    Data Structures with Paralell Access

    Get PDF
    Paralelní programování přináší, kromě možnosti rozložit běh programu na více současně běžících procesů sdílejících data, také některé problémy. Je potřeba tyto paralelně běžící procesy synchronizovat a zařídit, že při komunikaci a sdílení dat nedojde k problémům vycházejícím z toho, že běží mnoho procesů současně. Tyto synchronizační algoritmy také nesmí příliš zatížit celkový běh programu. Tato práce popisuje způsoby synchronizace procesů a také je zde implementováno několik různých algoritmů pro práci s paralelní frontou, které jsou výkonnostně otestovány.Parallel programming brings out, apart from the opportunity to spread out a program execution to many simultaneously running processes sharing data, some new problems. It is necessary to synchronize these processes running in parallel and make sure that during the process communication and data sharing there will not arise any troubles. These synchronization algorithms also cannot use too much resources that are otherwise used for the actual program. This thesis describes ways of process synchronization and also provides an implementation of several algorithms for parallel queue. Implemented algorithms were also tested for their performance. 

    Towards Scalable Parallel Fibonacci Heap Implementation

    Get PDF
    With the advancement of multiple processors, the sequential algorithms are being investigated and gradually substituted for its concurrent equivalent to effectively exploit the parallel architecture. Parallel algorithms speed up the performance by dividing the task into a number of processes (or threads) that can be scheduled and executed simultaneously in independent processing units. Various well-known basic algorithms and data-structures have been explored for its efficient parallel counterparts and have been published as popular libraries. However, advanced data-structures and algorithms have not seen similar investigation mainly because they have many optimization steps mostly backed by many states and finding safe and efficient parallel implementation isn’t an easy endeavor. Safety concerns for shared-memory parallel implementation are of utmost importance as it provides a basis for consistency of any data structure and algorithm. There are well-known tools like locks, semaphores, atomic operations and so on that assist towards safe parallel implementation but using them effectively and in well-defined synchronization are key factors in the overall performance of any data-structures and algorithms. This paper explores an advanced data structure, Fibonacci Heap, and its operations to evaluate its implementation using two different synchronization mechanisms: Coarse-grained and Fine-grained. The analysis in this paper shows that a fine-grained synchronized Fibonacci Heap implementation with certainly relaxed semantics is more scalable with growing number of concurrency in comparison to the coarse-grained synchronized Fibonacci Heap implementation
    corecore