830 research outputs found

    High-performance SIMT code generation in an active visual effects library

    Full text link

    IST Austria Thesis

    Get PDF
    In this thesis we present a computer-aided programming approach to concurrency. Our approach helps the programmer by automatically fixing concurrency-related bugs, i.e. bugs that occur when the program is executed using an aggressive preemptive scheduler, but not when using a non-preemptive (cooperative) scheduler. Bugs are program behaviours that are incorrect w.r.t. a specification. We consider both user-provided explicit specifications in the form of assertion statements in the code as well as an implicit specification. The implicit specification is inferred from the non-preemptive behaviour. Let us consider sequences of calls that the program makes to an external interface. The implicit specification requires that any such sequence produced under a preemptive scheduler should be included in the set of sequences produced under a non-preemptive scheduler. We consider several semantics-preserving fixes that go beyond atomic sections typically explored in the synchronisation synthesis literature. Our synthesis is able to place locks, barriers and wait-signal statements and last, but not least reorder independent statements. The latter may be useful if a thread is released to early, e.g., before some initialisation is completed. We guarantee that our synthesis does not introduce deadlocks and that the synchronisation inserted is optimal w.r.t. a given objective function. We dub our solution trace-based synchronisation synthesis and it is loosely based on counterexample-guided inductive synthesis (CEGIS). The synthesis works by discovering a trace that is incorrect w.r.t. the specification and identifying ordering constraints crucial to trigger the specification violation. Synchronisation may be placed immediately (greedy approach) or delayed until all incorrect traces are found (non-greedy approach). For the non-greedy approach we construct a set of global constraints over synchronisation placements. Each model of the global constraints set corresponds to a correctness-ensuring synchronisation placement. The placement that is optimal w.r.t. the given objective function is chosen as the synchronisation solution. We evaluate our approach on a number of realistic (albeit simplified) Linux device-driver benchmarks. The benchmarks are versions of the drivers with known concurrency-related bugs. For the experiments with an explicit specification we added assertions that would detect the bugs in the experiments. Device drivers lend themselves to implicit specification, where the device and the operating system are the external interfaces. Our experiments demonstrate that our synthesis method is precise and efficient. We implemented objective functions for coarse-grained and fine-grained locking and observed that different synchronisation placements are produced for our experiments, favouring e.g. a minimal number of synchronisation operations or maximum concurrency

    Ensuring performance and correctness for legacy parallel programs

    Get PDF
    Modern computers are based on manycore architectures, with multiple processors on a single silicon chip. In this environment programmers are required to make use of parallelism to fully exploit the available cores. This can either be within a single chip, normally using shared-memory programming or at a larger scale on a cluster of chips, normally using message-passing. Legacy programs written using either paradigm face issues when run on modern manycore architectures. In message-passing the problem is performance related, with clusters based on manycores introducing necessarily tiered topologies that unaware programs may not fully exploit. In shared-memory it is a correctness problem, with modern systems employing more relaxed memory consistency models, on which legacy programs were not designed to operate. Solutions to this correctness problem exist, but introduce a performance problem as they are necessarily conservative. This thesis focuses on addressing these problems, largely through compile-time analysis and transformation. The first technique proposed is a method for statically determining the communication graph of an MPI program. This is then used to optimise process placement in a cluster of CMPs. Using the 64-process versions of the NAS parallel benchmarks, we see an average of 28% (7%) improvement in communication localisation over by-rank scheduling for 8-core (12-core) CMP-based clusters, representing the maximum possible improvement. Secondly, we move into the shared-memory paradigm, identifying and proving necessary conditions for a read to be an acquire. This can be used to improve solutions in several application areas, two of which we then explore. We apply our acquire signatures to the problem of fence placement for legacy well-synchronised programs. We find that applying our signatures, we can reduce the number of fences placed by an average of 62%, leading to a speedup of up to 2.64x over an existing practical technique. Finally, we develop a dynamic synchronisation detection tool known as SyncDetect. This proof of concept tool leverages our acquire signatures to more accurately detect ad hoc synchronisations in running programs and provides the programmer with a report of their locations in the source code. The tool aims to assist programmers with the notoriously difficult problem of parallel debugging and in manually porting legacy programs to more modern (relaxed) memory consistency models

    The MOLDY short-range molecular dynamics package

    Full text link
    We describe a parallelised version of the MOLDY molecular dynamics program. This Fortran code is aimed at systems which may be described by short-range potentials and specifically those which may be addressed with the embedded atom method. This includes a wide range of transition metals and alloys. MOLDY provides a range of options in terms of the molecular dynamics ensemble used and the boundary conditions which may be applied. A number of standard potentials are provided, and the modular structure of the code allows new potentials to be added easily. The code is parallelised using OpenMP and can therefore be run on shared memory systems, including modern multicore processors. Particular attention is paid to the updates required in the main force loop, where synchronisation is often required in OpenMP implementations of molecular dynamics. We examine the performance of the parallel code in detail and give some examples of applications to realistic problems, including the dynamic compression of copper and carbon migration in an iron-carbon alloy

    Metaheuristic approaches for Complete Network Signal Setting Design (CNSSD)

    Get PDF
    2014 - 2015In order to mitigate the urban traffic congestion and increase the travelers’ surplus, several policies can be adopted which may be applied in short or long time horizon. With regards to the short term policies, one of the most straightforward is control through traffic lights at single junction or network level. The main goal of traffic control is avoiding that incompatible approaches have green at the same time. With respect to this aim existing methodologies for Signal Setting Design (NSSD) can be divided into two classes as in following described Approach-based (or Phase-based) methods address the signal setting as a periodic scheduling problem: the cycle length, and for each approach the start and the end of the green are considered as decision variables, some binary variables (or some non-linear constraints) are included to avoid incompatible approaches having green at the same time (see for instance Improta and Cantarella, 1987). If needed the stage composition and sequence may easily be obtained from decision variables. Commercial software codes following this methodology are available for single junction control only, such Oscady Pro® (TRL, UK; Burrow, 1987). Once the green timing and scheduling have been carried out for each junction, offsets can be optimized (coordination) using the stage matrices obtained from single junction optimization (possibly together with green splits again) through one of codes mentioned below. Stage-based signal setting methods dealt with that by dividing the cycle length into stages, each one being a time interval during which some mutually compatible approaches have green. Stage composition, say which approaches have green, and sequence, say their order, can be represented through the approach-stage incidence matrix, or stage matrix for short. Once the stage matrix is given for each junction, the cycle length, the green splits and the offsets can be optimised (synchronisation) through some well established commercial software codes. Two of the most commonly used codes are: TRANSYT14® (TRL, UK) (recently TRANSYT15® has been released) and TRANSYT-7F® (FHWA, USA). Both allow to compute the green splits, the offsets and the cycle length by combining a traffic flow model and a signal setting optimiser. Both may be used for coordination (optimisation of offsets only, once green splits are known) or synchronisation. TRANSYT14® generates several (but not all) significant stage sequences to be tested but the optimal solution is not endogenously generated, while TRANSYT-7F® is able to optimise the stage sequence for each single junction starting from the ring and barrier NEMA (i.e. National Electrical Manufacturers Association) phases. Still these methods do not allow for stage matrix optimisation; moreover the effects of stage composition and sequence on network performance are not well analysed in literature... [edited by Author]XIV n.s
    corecore