351 research outputs found
Clarifying and compiling C/C++ concurrency: from C++11 to POWER
The upcoming C and C++ revised standards add concurrency to the languages, for the first time, in the form of a subtle *relaxed memory model* (the *C++11 model*). This aims to permit compiler optimisation and to accommodate the differing relaxed-memory behaviours of mainstream multiprocessors, combining simple semantics for most code with high-performance *low-level atomics* for concurrency libraries. In this paper, we first establish two simpler but provably equivalent models for C++11, one for the full language and another for the subset without consume operations. Subsetting further to the fragment without low-level atomics, we identify a subtlety arising from atomic initialisation and prove that, under an additional condition, the model is equivalent to sequential consistency for race-free programs
Formal Modelling, Testing and Verification of HSA Memory Models using Event-B
The HSA Foundation has produced the HSA Platform System Architecture
Specification that goes a long way towards addressing the need for a clear and
consistent method for specifying weakly consistent memory. HSA is specified in
a natural language which makes it open to multiple ambiguous interpretations
and could render bugs in implementations of it in hardware and software. In
this paper we present a formal model of HSA which can be used in the
development and verification of both concurrent software applications as well
as in the development and verification of the HSA-compliant platform itself. We
use the Event-B language to build a provably correct hierarchy of models from
the most abstract to a detailed refinement of HSA close to implementation
level. Our memory models are general in that they represent an arbitrary number
of masters, programs and instruction interleavings. We reason about such
general models using refinements. Using Rodin tool we are able to model and
verify an entire hierarchy of models using proofs to establish that each
refinement is correct. We define an automated validation method that allows us
to test baseline compliance of the model against a suite of published HSA
litmus tests. Once we complete model validation we develop a coverage driven
method to extract a richer set of tests from the Event-B model and a user
specified coverage model. These tests are used for extensive regression testing
of hardware and software systems. Our method of refinement based formal
modelling, baseline compliance testing of the model and coverage driven test
extraction using the single language of Event-B is a new way to address a key
challenge facing the design and verification of multi-core systems.Comment: 9 pages, 10 figure
The use of model-checking for the verification of concurrent algorithms
The design of concurrent algorithms tends to be a long and difficult process. Increasing the number of concurrent entities to realistic numbers makes manual verification of these algorithms almost impossible. Designers normally resort to running these algorithms exhaustively yet can never be guaranteed of their correctness. In this report, we propose the use of a model-checker (SMV) as a machine-automated tool for the verification of these algorithms. We present methods how this tool can be used to encode algorithms and allow properties to be guaranteed for uni-processor machines running a scheduler or SMP machines. We also present a language-generator allowing the designer to use a description language that is then automatically converted to the model-checkerâs native language. We show how this approach was successful in encoding a concurrent algorithm and is able to verify the desired properties.peer-reviewe
Execution replay and debugging
As most parallel and distributed programs are internally non-deterministic --
consecutive runs with the same input might result in a different program flow
-- vanilla cyclic debugging techniques as such are useless. In order to use
cyclic debugging tools, we need a tool that records information about an
execution so that it can be replayed for debugging. Because recording
information interferes with the execution, we must limit the amount of
information and keep the processing of the information fast. This paper
contains a survey of existing execution replay techniques and tools.Comment: In M. Ducasse (ed), proceedings of the Fourth International Workshop
on Automated Debugging (AADebug 2000), August 2000, Munich. cs.SE/001003
Non-intrusive on-the-fly data race detection using execution replay
This paper presents a practical solution for detecting data races in parallel
programs. The solution consists of a combination of execution replay (RecPlay)
with automatic on-the-fly data race detection. This combination enables us to
perform the data race detection on an unaltered execution (almost no probe
effect). Furthermore, the usage of multilevel bitmaps and snooped matrix clocks
limits the amount of memory used. As the record phase of RecPlay is highly
efficient, there is no need to switch it off, hereby eliminating the
possibility of Heisenbugs because tracing can be left on all the time.Comment: In M. Ducasse (ed), proceedings of the Fourth International Workshop
on Automated Debugging (AAdebug 2000), August 2000, Munich. cs.SE/001003
Harvesting graphics power for MD simulations
We discuss an implementation of molecular dynamics (MD) simulations on a
graphic processing unit (GPU) in the NVIDIA CUDA language. We tested our code
on a modern GPU, the NVIDIA GeForce 8800 GTX. Results for two MD algorithms
suitable for short-ranged and long-ranged interactions, and a congruential
shift random number generator are presented. The performance of the GPU's is
compared to their main processor counterpart. We achieve speedups of up to 80,
40 and 150 fold, respectively. With newest generation of GPU's one can run
standard MD simulations at 10^7 flops/$.Comment: 12 pages, 5 figures. Submitted to Mol. Si
All-Pairs Shortest Path Algorithms Using CUDA
Utilising graph theory is a common activity in computer science. Algorithms that perform computations on large graphs are not always cost effective, requiring supercomputers to achieve results in a practical amount of time. Graphics Processing Units provide a cost effective alternative to supercomputers, allowing parallel algorithms to be executed directly on the Graphics Processing Unit. Several algorithms exist to solve the All-Pairs Shortest Path problem on the Graphics Processing Unit, but it can be difficult to determine whether the claims made are true and verify the results listed. This research asks "Which All-Pairs Shortest Path algorithms solve the All-Pairs Shortest Path problem the fastest, and can the authors' claims be verified?" The results we obtain when answering this question show why it is important to be able to collate existing work, and analyse them on a common platform to observe fair results retrieved from a single system. In this way, the research shows us how effective each algorithm is at performing its task, and suggest when a certain algorithm might be used over another
Ensuring performance and correctness for legacy parallel programs
Modern computers are based on manycore architectures, with multiple processors on
a single silicon chip. In this environment programmers are required to make use of
parallelism to fully exploit the available cores. This can either be within a single chip,
normally using shared-memory programming or at a larger scale on a cluster of chips,
normally using message-passing.
Legacy programs written using either paradigm face issues when run on modern
manycore architectures. In message-passing the problem is performance related,
with clusters based on manycores introducing necessarily tiered topologies that unaware
programs may not fully exploit. In shared-memory it is a correctness problem,
with modern systems employing more relaxed memory consistency models, on which
legacy programs were not designed to operate. Solutions to this correctness problem
exist, but introduce a performance problem as they are necessarily conservative. This
thesis focuses on addressing these problems, largely through compile-time analysis
and transformation.
The first technique proposed is a method for statically determining the communication
graph of an MPI program. This is then used to optimise process placement in
a cluster of CMPs. Using the 64-process versions of the NAS parallel benchmarks,
we see an average of 28% (7%) improvement in communication localisation over by-rank
scheduling for 8-core (12-core) CMP-based clusters, representing the maximum
possible improvement.
Secondly, we move into the shared-memory paradigm, identifying and proving
necessary conditions for a read to be an acquire. This can be used to improve solutions
in several application areas, two of which we then explore.
We apply our acquire signatures to the problem of fence placement for legacy well-synchronised
programs. We find that applying our signatures, we can reduce the number
of fences placed by an average of 62%, leading to a speedup of up to 2.64x over an
existing practical technique.
Finally, we develop a dynamic synchronisation detection tool known as SyncDetect.
This proof of concept tool leverages our acquire signatures to more accurately
detect ad hoc synchronisations in running programs and provides the programmer with
a report of their locations in the source code. The tool aims to assist programmers with
the notoriously difficult problem of parallel debugging and in manually porting legacy
programs to more modern (relaxed) memory consistency models
Some aspects of the efficient use of multiprocessor control systems
Computer technology, particularly at the circuit level, is fast
approaching its physical limitations. As future needs for greater
power from computing systems grows, increases in circuit switching
speed (and thus instruction speed) will be unable to match these
requirements.
Greater power can also be obtained by incorporating several processing
units into a single system. This ability to increase the performance
of a system by the addition of processing units is one of the major
advantages of multiprocessor systems. Four major characteristics of
multiprocessor systems have been identified (28) which demonstrate
their advantage. These are:-
Throughput
Flexibility
Availability
Reliability
The additional throughput obtained from a multiprocessor has been
mentioned above.. This increase in the power of the system can be
obtained in a modular fashion with extra processors being added as
greater processing needs arise. The addition of extra processors
also has (in general) the desirable advantage of giving a smoother
cost - performance curve ( 63). Flexibility is obtained from the
increased ability to construct a system matching the user 'requirements
at a given time without placing restrictions upon future expansion.
With multiprocessor systems; the potential also exists of making
greater use of the resources within the system.
Availability and reliability are inter-related. Increased availability
is achieved, in a well designed system, by ensuring that processing
capabilities can be provided to the user even if one (or more) of the
processing units has failed. The service provided, however, will
probably be degraded due to the reduction in processing capacity.
Increased reliability is obtained by the ability of the processing
units to compensate for the failure of one of their number. This
recovery may involve complex software checks and a consequent decrease
in available power even when all the units are functioning
- âŠ