3,539 research outputs found

    A Wait-free Multi-word Atomic (1,N) Register for Large-scale Data Sharing on Multi-core Machines

    Get PDF
    We present a multi-word atomic (1,N) register for multi-core machines exploiting Read-Modify-Write (RMW) instructions to coordinate the writer and the readers in a wait-free manner. Our proposal, called Anonymous Readers Counting (ARC), enables large-scale data sharing by admitting up to 232āˆ’22^{32}-2 concurrent readers on off-the-shelf 64-bits machines, as opposed to the most advanced RMW-based approach which is limited to 58 readers. Further, ARC avoids multiple copies of the register content when accessing it---this affects classical register's algorithms based on atomic read/write operations on single words. Thus it allows for higher scalability with respect to the register size. Moreover, ARC explicitly reduces improves performance via a proper limitation of RMW instructions in case of read operations, and by supporting constant time for read operations and amortized constant time for write operations. A proof of correctness of our register algorithm is also provided, together with experimental data for a comparison with literature proposals. Beyond assessing ARC on physical platforms, we carry out as well an experimentation on virtualized infrastructures, which shows the resilience of wait-free synchronization as provided by ARC with respect to CPU-steal times, proper of more modern paradigms such as cloud computing.Comment: non

    High performance constraint satisfaction problem solving: State-recomputation versus state-copying.

    Get PDF
    Constraint Satisfaction Problems (CSPs) in Artificial Intelligence have been an important focus of research and have been a useful model for various applications such as scheduling, image processing and machine vision. CSPs are mathematical problems that try to search values for variables according to constraints. There are many approaches for searching solutions of non-binary CSPs. Traditionally, most CSP methods rely on a single processor. With the increasing popularization of multiple processors, parallel search methods are becoming alternatives to speed up the search process. Parallel search is a subfield of artificial intelligence in which the constraint satisfaction problem is centralized whereas the search processes are distributed among the different processors. In this thesis we present a forward checking algorithm solving non-binary CSPs by distributing different branches to different processors via message passing interface and execute it on a high performance distributed system called SHARCNET. However, the problem is how to efficiently communicate the state of the search among processors. Two communication models, namely, state-recomputation and state-copying via message passing, are implemented and evaluated. This thesis investigates the behaviour of communication from one process to another. The experimental results demonstrate that the state-recomputation model with tighter constraints obtains a better performance than the state-copying model, but when constraints become looser, the state-copying model is a better choice.Dept. of Computer Science. Paper copy at Leddy Library: Theses & Major Papers - Basement, West Bldg. / Call Number: Thesis2004 .Y364. Source: Masters Abstracts International, Volume: 44-01, page: 0417. Thesis (M.Sc.)--University of Windsor (Canada), 2005

    Anonymous Readers Counting: A Wait-Free Multi-Word Atomic Register Algorithm for Scalable Data Sharing on Multi-Core Machines

    Get PDF
    In this article we present Anonymous Readers Counting (ARC), a multi-word atomic (1,N) register algorithm for multi-core machines. ARC exploits Read-Modify-Write (RMW) instructions to coordinate the writer and reader threads in a wait-free manner and enables large-scale data sharing by admitting up to (232-2) concurrent readers on off-the-shelf 64-bit machines, as opposed to the most advanced RMW-based approach which is limited to 58 readers on the same kind of machines. Further, ARC avoids multiple copies of the register content when accessing it - this is a problem that affects classical register algorithms based on atomic read/write operations on single words. Thus it allows for higher scalability with respect to the register size. Moreover, ARC explicitly reduces the overall power consumption, via a proper limitation of RMW instructions in case of read operations re-accessing a still-valid snapshot of the register content, and by showing constant time for read operations and amortized constant time for write operations. Our proposal has therefore a strong focus on real-world off-the-shelf architectures, allowing us to capture properties which benefit both performance and power consumption. A proof of correctness of our register algorithm is also provided, together with experimental data for a comparison with literature proposals. Beyond assessing ARC on physical platforms, we carry out as well an experimentation on virtualized infrastructures, which shows the resilience of wait-free synchronization as provided by ARC with respect to CPU-steal times, proper of modern paradigms such as cloud computing. Finally, we discuss how to extend ARC for scenarios with multiple writers and multiple readers - the so called (M,N) register. This is achieved not by changing the operations (and their wait-free nature) executed along the critical path of the threads, rather only changing the ratio between the number of buffers keeping the register snapshots and the number of threads to coordinate, as well as the number of bits used for counting readers within a 64-bit mask accessed via RMW instructions - just depending on the target balance between the number of readers and the number of writers to be supported

    Parallelism in Constraint Programming

    Get PDF
    Writing efficient parallel programs is the biggest challenge of the software industry for the foreseeable future. We are currently in a time when parallel computers are the norm, not the exception. Soon, parallel processors will be standard even in cell phones. Without drastic changes in hardware development, all software must be parallelized to its fullest extent. Parallelism can increase performance and reduce power consumption at the same time. Many programs will execute faster on a dual-core processor than a single core processor running at twice the speed. Halving the speed of a processor can reduce the power consumption up to four times. Hence, parallelism gives more performance per unit of power to efficient programs. In order to make use of parallel hardware, we need to overcome the difficulties of parallel programming. To many programmers, it is easier to learn a handful of small domain-specific programming languages than to learn efficient parallel programming. The frameworks for these languages can then automatically parallelize the program. Automatically parallelizing traditional programs is usually much more difficult. In this thesis, we study and present parallelism in constraint programming (CP). We have developed the first constraint framework that automatically parallelizes both the consistency and the search of the solving process. This allows programmers to avoid the difficult issues of parallel programming. We also study distributed CP with independent agents and propose solutions to this problem. Our results show that automatic parallelism in CP can provide very good performance. Our parallel consistency scales very well for problems with many large constraints. We also manage to combine parallel consistency and parallel search with a performance increase. The communication and load-balancing schemes we developed increase the scalability of parallel search. Our model for distributed CP is orders of magnitude faster than traditional approaches. As far as we know, it is the first to solve standard benchmark scheduling problems

    Nemos: a framework for axiomatic and executable specifications of memory consistency models

    Get PDF
    technical reportConforming to the underlying memory consistency rules is a fundamental require- ment for implementing shared memory systems and writing multiprocessor programs. In order to promote understanding and enable automated verification, it is highly desir- able that a memory model specification be both declarative and executable. We have developed a specification framework called Nemos (Non-operational yet Executable Memory Ordering Specifications), which employs a uniform notation based on predi- cate logic to define shared memory semantics in an axiomatic as well as compositional style. In this paper, we present this framework and discuss how constraint logic pro- gramming and SAT solving can be used to make these axiomatic specifications exe- cutable for memory model analysis, thus supporting precise specification and automatic execution in the same framework. To illustrate our approach, this paper formalizes a collection of well known memory models, including sequential consistency, coherence, PRAM, causal consistency, and processor consistency

    A Taxonomy of Self-configuring Service Discovery Systems

    Get PDF
    We analyze the fundamental concepts and issues in service discovery. This analysis places service discovery in the context of distributed systems by describing service discovery as a third generation naming system. We also describe the essential architectures and the functionalities in service discovery. We then proceed to show how service discovery fits into a system, by characterizing operational aspects. Subsequently, we describe how existing state of the art performs service discovery, in relation to the operational aspects and functionalities, and identify areas for improvement

    Proceedings of the real-time database workshop, Eindhoven, 23 February 1995

    Get PDF

    The parallel computation of morse-smale complexes

    Get PDF
    pre-printTopology-based techniques are useful for multi-scale exploration of the feature space of scalar-valued functions, such as those derived from the output of large-scale simulations. The Morse-Smale (MS) complex, in particular, allows robust identification of gradient-based features, and therefore is suitable for analysis tasks in a wide range of application domains. In this paper, we develop a two-stage algorithm to construct the Morse-Smale complex in parallel, the first stage independently computing local features per block and the second stage merging to resolve global features. Our implementation is based on MPI and a distributed-memory architecture. Through a set of scalability studies on the IBM Blue Gene/P supercomputer, we characterize the performance of the algorithm as block sizes, process counts, merging strategy, and levels of topological simplification are varied, for datasets that vary in feature composition and size. We conclude with a strong scaling study using scientific datasets computed by combustion and hydrodynamics simulations
    • ā€¦
    corecore