133 research outputs found

    MDMP: Managed Data Message Passing

    Full text link
    MDMP is a new parallel programming approach that aims to provide users with an easy way to add parallelism to programs, optimise the message passing costs of traditional scientific simulation algorithms, and enable existing MPI-based parallel programs to be optimised and extended without requiring the whole code to be re-written from scratch. MDMP utilises a directives based approach to enable users to specify what communications should take place in the code, and then implements those communications for the user in an optimal manner using both the information provided by the user and data collected from instrumenting the code and gathering information on the data to be communicated. This work will present the basic concepts and functionality of MDMP and discuss the performance that can be achieved using our prototype implementation of MDMP on some model scientific simulation applications.Comment: Submitted to SC13, 10 pages, 5 figure

    Irregular Computations in Fortran – Expression and Implementation Strategies

    Get PDF

    Efficient Machine-Independent Programming of High-Performance Multiprocessors

    Get PDF
    Parallel computing is regarded by most computer scientists as the most likely approach for significantly improving computing power for scientists and engineers. Advances in programming languages and parallelizing compilers are making parallel computers easier to use by providing a high-level portable programming model that protects software investment. However, experience has shown that simply finding parallelism is not always sufficient for obtaining good performance from today's multiprocessors. The goal of this project is to develop advanced compiler analysis of data and computation decompositions, thread placement, communication, synchronization, and memory system effects needed in order to take advantage of performance-critical elements in modern parallel architectures

    Runtime address space computation for SDSM systems

    Get PDF
    This paper explores the benefits and limitations of using a inspector/executor approach for Software Distributed Shared Memory (SDSM) systems. The role of the inspector is to obtain a description of the address space accessed during the execution of parallel loops. The information collected by the inspector will enable the runtime to optimize the movement of shared data that will happen during the executor phase. This paper addresses the main issues that have been considered to embed an inspector/executor model in a SDSM system: amount of data collected by the inspector, the accurateness of this data when the loop has data and/or control dependences, and the computational overhead introduced. The paper also includes a description of the SDSM system where the inspector/executor model has been embedded. The proposal is evaluated with four applications from the NAS benchmark suite. The evaluation shows that the accuracy of the inspection and the small overheads introduced by the approach allow its use in a SDSM system.Peer ReviewedPostprint (published version

    Type Oriented Parallel Programming

    Get PDF
    Context: Parallel computing is an important field within the sciences. With the emergence of multi, and soon many, core CPUs this is moving more and more into the domain of general computing. HPC programmers want performance, but at the moment this comes at a cost; parallel languages are either efficient or conceptually simple, but not both. Aim: To develop and evaluate a novel programming paradigm which will address the problem of parallel programming and allow for languages which are both conceptually simple and efficient. Method: A type-based approach, which allows the programmer to control all aspects of parallelism by the use and combination of types has been developed. As a vehicle to present and analyze this new paradigm a parallel language, Mesham, and associated compilation tools have also been created. By using types to express parallelism the programmer can exercise efficient, flexible control in a high level abstract model yet with a sufficiently rich amount of information in the source code upon which the compiler can perform static analysis and optimization. Results: A number of case studies have been implemented in Mesham. Official benchmarks have been performed which demonstrate the paradigm allows one to write code which is comparable, in terms of performance, with existing high performance solutions. Sections of the parallel simulation package, Gadget-2, have been ported into Mesham, where substantial code simplifications have been made. Conclusions: The results obtained indicate that the type-based approach does satisfy the aim of the research described in this thesis. By using this new paradigm the programmer has been able to write parallel code which is both simple and efficient

    Array optimizations for high productivity programming languages

    Get PDF
    While the HPCS languages (Chapel, Fortress and X10) have introduced improvements in programmer productivity, several challenges still remain in delivering high performance. In the absence of optimization, the high-level language constructs that improve productivity can result in order-of-magnitude runtime performance degradations. This dissertation addresses the problem of efficient code generation for high-level array accesses in the X10 language. The X10 language supports rank-independent specification of loop and array computations using regions and points. Three aspects of high-level array accesses in X10 are important for productivity but also pose significant performance challenges: high-level accesses are performed through Point objects rather than integer indices, variables containing references to arrays are rank-independent, and array subscripts are verified as legal array indices during runtime program execution. Our solution to the first challenge is to introduce new analyses and transformations that enable automatic inlining and scalar replacement of Point objects. Our solution to the second challenge is a hybrid approach. We use an interprocedural rank analysis algorithm to automatically infer ranks of arrays in X10. We use rank analysis information to enable storage transformations on arrays. If rank-independent array references still remain after compiler analysis, the programmer can use X10's dependent type system to safely annotate array variable declarations with additional information for the rank and region of the variable, and to enable the compiler to generate efficient code in cases where the dependent type information is available. Our solution to the third challenge is to use a new interprocedural array bounds analysis approach using regions to automatically determine when runtime bounds checks are not needed. Our performance results show that our optimizations deliver performance that rivals the performance of hand-tuned code with explicit rank-specific loops and lower-level array accesses, and is up to two orders of magnitude faster than unoptimized, high-level X10 programs. These optimizations also result in scalability improvements of X10 programs as we increase the number of CPUs. While we perform the optimizations primarily in X10, these techniques are applicable to other high-productivity languages such as Chapel and Fortress

    Optimization techniques for fine-grained communication in PGAS environments

    Get PDF
    Partitioned Global Address Space (PGAS) languages promise to deliver improved programmer productivity and good performance in large-scale parallel machines. However, adequate performance for applications that rely on fine-grained communication without compromising their programmability is difficult to achieve. Manual or compiler assistance code optimization is required to avoid fine-grained accesses. The downside of manually applying code transformations is the increased program complexity and hindering of the programmer productivity. On the other hand, compiler optimizations of fine-grained accesses require knowledge of physical data mapping and the use of parallel loop constructs. This thesis presents optimizations for solving the three main challenges of the fine-grain communication: (i) low network communication efficiency; (ii) large number of runtime calls; and (iii) network hotspot creation for the non-uniform distribution of network communication, To solve this problems, the dissertation presents three approaches. First, it presents an improved inspector-executor transformation to improve the network efficiency through runtime aggregation. Second, it presents incremental optimizations to the inspector-executor loop transformation to automatically remove the runtime calls. Finally, the thesis presents a loop scheduling loop transformation for avoiding network hotspots and the oversubscription of nodes. In contrast to previous work that use static coalescing, prefetching, limited privatization, and caching, the solutions presented in this thesis focus cover all the aspect of fine-grained communication, including reducing the number of calls generated by the compiler and minimizing the overhead of the inspector-executor optimization. A performance evaluation with various microbenchmarks and benchmarks, aiming at predicting scaling and absolute performance numbers of a Power 775 machine, indicates that applications with regular accesses can achieve up to 180% of the performance of hand-optimized versions, while in applications with irregular accesses the transformations are expected to yield from 1.12X up to 6.3X speedup. The loop scheduling shows performance gains from 3-25% for NAS FT and bucket-sort benchmarks, and up to 3.4X speedup for the microbenchmarks

    COMPILER TECHNIQUES FOR EFFICIENT COMMUNICATIONS IN MULTIPROCESSOR SYSTEMS

    Get PDF
    Technical advances have brought circuit switching back to the stage of interconnection network design for high performance computing. Although circuit switching has long connection establishment delays and the dedication of connections prevents other communicating nodes from sharing the network, it has simple control logic and significant cost advantage over packet or wormhole switching. With the proper assistance from compilers, circuit switching has the potential of providing significant performance benefits when connections can be established prior to the actual communication. This dissertation presents a novel compilation framework for achieving efficient communications in circuit switching interconnection networks. The goal of the framework is to identify communication patterns in Single-Program-Multiple-Data (SPMD) parallel applications and compile these patterns as network configuration directives. This can significantly reduce the communication overhead on circuit switching interconnection networks. A powerful representation scheme is developed in this research to capture the property of communication patterns and allow manipulation of these patterns. Based on the temporal and spatial localities of communications and the capability of the compiler to identify the communication patterns, we classify communication patterns into three categories - static, persistent, and dynamic. We target static and persistent communications, which are dominant in most parallel applications. To identify communication patterns, we develop a novel symbolic expression analysis. We develop certain compiler techniques for analyzing communication patterns. Since the underlying network capacity is limited, we develop an algorithm to partition the program into phases based on the communication requirements and network capacity. To demonstrate the effectiveness of our framework, we implement an experimental compiler. The compiler identifies the communication patterns from the source code, partitions the program into phases, and inserts the network configuration directives at phase boundaries to achieve efficient communications. The compiler also can generate communication traces, which provides useful information about the communication pattern correlated to the structure of the source code. We develop a multiprocessor system simulator to evaluate our techniques. Our simulation-based performance analysis demonstrates that using our compiler techniques can achieve the same level, or even better level of communication performance than fast packet switching networks while using much less expensive circuit switches

    Evaluating the performance of software distributed shared memory as a target for parallelizing compilers

    Get PDF
    In this paper we evaluate the use of software distributed shared memory (DSM) on a message passing machine as the target for a parallelizing compiler. We compare this approach to compiler-generated message passing, hand-coded software DSM and hand-coded message passing. For this comparison, we use six applications: four that are regular and two that are irregular: Our results are gathered on an 8-node IBM SP/2 using the TreadMarks software DSM system. We use the APR shared-memory (SPF) compiler to generate the shared memory-programs and the APR XHPF compiler to generate message passing programs. The hand-coded message passing programs run with the IBM PVMe optimized message passing library. On the regular programs, both the compiler-generated and the hand-coded message passing outperform the SPF/TreadMarks combination: the compiler-generated message passing by 5.5% to 40%, and the hand-coded message passing by 7.5% to 49%. On the irregular programs, the SPF/TreadMarks combination outperforms the compiler-generated message passing by 38% and 89%, and only slightly underperforms the hand-coded message passing, differing by 4.4% and 16%. We also identify the factors that account for the performance differences, estimate their relative importance, and describe methods to improve the performanc
    • …
    corecore