351 research outputs found

    Clarifying and compiling C/C++ concurrency: from C++11 to POWER

    Get PDF
    The upcoming C and C++ revised standards add concurrency to the languages, for the first time, in the form of a subtle *relaxed memory model* (the *C++11 model*). This aims to permit compiler optimisation and to accommodate the differing relaxed-memory behaviours of mainstream multiprocessors, combining simple semantics for most code with high-performance *low-level atomics* for concurrency libraries. In this paper, we first establish two simpler but provably equivalent models for C++11, one for the full language and another for the subset without consume operations. Subsetting further to the fragment without low-level atomics, we identify a subtlety arising from atomic initialisation and prove that, under an additional condition, the model is equivalent to sequential consistency for race-free programs

    Formal Modelling, Testing and Verification of HSA Memory Models using Event-B

    Full text link
    The HSA Foundation has produced the HSA Platform System Architecture Specification that goes a long way towards addressing the need for a clear and consistent method for specifying weakly consistent memory. HSA is specified in a natural language which makes it open to multiple ambiguous interpretations and could render bugs in implementations of it in hardware and software. In this paper we present a formal model of HSA which can be used in the development and verification of both concurrent software applications as well as in the development and verification of the HSA-compliant platform itself. We use the Event-B language to build a provably correct hierarchy of models from the most abstract to a detailed refinement of HSA close to implementation level. Our memory models are general in that they represent an arbitrary number of masters, programs and instruction interleavings. We reason about such general models using refinements. Using Rodin tool we are able to model and verify an entire hierarchy of models using proofs to establish that each refinement is correct. We define an automated validation method that allows us to test baseline compliance of the model against a suite of published HSA litmus tests. Once we complete model validation we develop a coverage driven method to extract a richer set of tests from the Event-B model and a user specified coverage model. These tests are used for extensive regression testing of hardware and software systems. Our method of refinement based formal modelling, baseline compliance testing of the model and coverage driven test extraction using the single language of Event-B is a new way to address a key challenge facing the design and verification of multi-core systems.Comment: 9 pages, 10 figure

    The use of model-checking for the verification of concurrent algorithms

    Get PDF
    The design of concurrent algorithms tends to be a long and difficult process. Increasing the number of concurrent entities to realistic numbers makes manual verification of these algorithms almost impossible. Designers normally resort to running these algorithms exhaustively yet can never be guaranteed of their correctness. In this report, we propose the use of a model-checker (SMV) as a machine-automated tool for the verification of these algorithms. We present methods how this tool can be used to encode algorithms and allow properties to be guaranteed for uni-processor machines running a scheduler or SMP machines. We also present a language-generator allowing the designer to use a description language that is then automatically converted to the model-checker’s native language. We show how this approach was successful in encoding a concurrent algorithm and is able to verify the desired properties.peer-reviewe

    Execution replay and debugging

    Full text link
    As most parallel and distributed programs are internally non-deterministic -- consecutive runs with the same input might result in a different program flow -- vanilla cyclic debugging techniques as such are useless. In order to use cyclic debugging tools, we need a tool that records information about an execution so that it can be replayed for debugging. Because recording information interferes with the execution, we must limit the amount of information and keep the processing of the information fast. This paper contains a survey of existing execution replay techniques and tools.Comment: In M. Ducasse (ed), proceedings of the Fourth International Workshop on Automated Debugging (AADebug 2000), August 2000, Munich. cs.SE/001003

    Non-intrusive on-the-fly data race detection using execution replay

    Full text link
    This paper presents a practical solution for detecting data races in parallel programs. The solution consists of a combination of execution replay (RecPlay) with automatic on-the-fly data race detection. This combination enables us to perform the data race detection on an unaltered execution (almost no probe effect). Furthermore, the usage of multilevel bitmaps and snooped matrix clocks limits the amount of memory used. As the record phase of RecPlay is highly efficient, there is no need to switch it off, hereby eliminating the possibility of Heisenbugs because tracing can be left on all the time.Comment: In M. Ducasse (ed), proceedings of the Fourth International Workshop on Automated Debugging (AAdebug 2000), August 2000, Munich. cs.SE/001003

    Harvesting graphics power for MD simulations

    Get PDF
    We discuss an implementation of molecular dynamics (MD) simulations on a graphic processing unit (GPU) in the NVIDIA CUDA language. We tested our code on a modern GPU, the NVIDIA GeForce 8800 GTX. Results for two MD algorithms suitable for short-ranged and long-ranged interactions, and a congruential shift random number generator are presented. The performance of the GPU's is compared to their main processor counterpart. We achieve speedups of up to 80, 40 and 150 fold, respectively. With newest generation of GPU's one can run standard MD simulations at 10^7 flops/$.Comment: 12 pages, 5 figures. Submitted to Mol. Si

    All-Pairs Shortest Path Algorithms Using CUDA

    Get PDF
    Utilising graph theory is a common activity in computer science. Algorithms that perform computations on large graphs are not always cost effective, requiring supercomputers to achieve results in a practical amount of time. Graphics Processing Units provide a cost effective alternative to supercomputers, allowing parallel algorithms to be executed directly on the Graphics Processing Unit. Several algorithms exist to solve the All-Pairs Shortest Path problem on the Graphics Processing Unit, but it can be difficult to determine whether the claims made are true and verify the results listed. This research asks "Which All-Pairs Shortest Path algorithms solve the All-Pairs Shortest Path problem the fastest, and can the authors' claims be verified?" The results we obtain when answering this question show why it is important to be able to collate existing work, and analyse them on a common platform to observe fair results retrieved from a single system. In this way, the research shows us how effective each algorithm is at performing its task, and suggest when a certain algorithm might be used over another

    Ensuring performance and correctness for legacy parallel programs

    Get PDF
    Modern computers are based on manycore architectures, with multiple processors on a single silicon chip. In this environment programmers are required to make use of parallelism to fully exploit the available cores. This can either be within a single chip, normally using shared-memory programming or at a larger scale on a cluster of chips, normally using message-passing. Legacy programs written using either paradigm face issues when run on modern manycore architectures. In message-passing the problem is performance related, with clusters based on manycores introducing necessarily tiered topologies that unaware programs may not fully exploit. In shared-memory it is a correctness problem, with modern systems employing more relaxed memory consistency models, on which legacy programs were not designed to operate. Solutions to this correctness problem exist, but introduce a performance problem as they are necessarily conservative. This thesis focuses on addressing these problems, largely through compile-time analysis and transformation. The first technique proposed is a method for statically determining the communication graph of an MPI program. This is then used to optimise process placement in a cluster of CMPs. Using the 64-process versions of the NAS parallel benchmarks, we see an average of 28% (7%) improvement in communication localisation over by-rank scheduling for 8-core (12-core) CMP-based clusters, representing the maximum possible improvement. Secondly, we move into the shared-memory paradigm, identifying and proving necessary conditions for a read to be an acquire. This can be used to improve solutions in several application areas, two of which we then explore. We apply our acquire signatures to the problem of fence placement for legacy well-synchronised programs. We find that applying our signatures, we can reduce the number of fences placed by an average of 62%, leading to a speedup of up to 2.64x over an existing practical technique. Finally, we develop a dynamic synchronisation detection tool known as SyncDetect. This proof of concept tool leverages our acquire signatures to more accurately detect ad hoc synchronisations in running programs and provides the programmer with a report of their locations in the source code. The tool aims to assist programmers with the notoriously difficult problem of parallel debugging and in manually porting legacy programs to more modern (relaxed) memory consistency models

    A compiler cost model for speculative multithreading chip-multiprocessor architectures

    Get PDF

    Some aspects of the efficient use of multiprocessor control systems

    Get PDF
    Computer technology, particularly at the circuit level, is fast approaching its physical limitations. As future needs for greater power from computing systems grows, increases in circuit switching speed (and thus instruction speed) will be unable to match these requirements. Greater power can also be obtained by incorporating several processing units into a single system. This ability to increase the performance of a system by the addition of processing units is one of the major advantages of multiprocessor systems. Four major characteristics of multiprocessor systems have been identified (28) which demonstrate their advantage. These are:- Throughput Flexibility Availability Reliability The additional throughput obtained from a multiprocessor has been mentioned above.. This increase in the power of the system can be obtained in a modular fashion with extra processors being added as greater processing needs arise. The addition of extra processors also has (in general) the desirable advantage of giving a smoother cost - performance curve ( 63). Flexibility is obtained from the increased ability to construct a system matching the user 'requirements at a given time without placing restrictions upon future expansion. With multiprocessor systems; the potential also exists of making greater use of the resources within the system. Availability and reliability are inter-related. Increased availability is achieved, in a well designed system, by ensuring that processing capabilities can be provided to the user even if one (or more) of the processing units has failed. The service provided, however, will probably be degraded due to the reduction in processing capacity. Increased reliability is obtained by the ability of the processing units to compensate for the failure of one of their number. This recovery may involve complex software checks and a consequent decrease in available power even when all the units are functioning
    • 

    corecore