3,168 research outputs found

    MPF: A portable message passing facility for shared memory multiprocessors

    Get PDF
    The design, implementation, and performance evaluation of a message passing facility (MPF) for shared memory multiprocessors are presented. The MPF is based on a message passing model conceptually similar to conversations. Participants (parallel processors) can enter or leave a conversation at any time. The message passing primitives for this model are implemented as a portable library of C function calls. The MPF is currently operational on a Sequent Balance 21000, and several parallel applications were developed and tested. Several simple benchmark programs are presented to establish interprocess communication performance for common patterns of interprocess communication. Finally, performance figures are presented for two parallel applications, linear systems solution, and iterative solution of partial differential equations

    An Efficient Cell List Implementation for Monte Carlo Simulation on GPUs

    Full text link
    Maximizing the performance potential of the modern day GPU architecture requires judicious utilization of available parallel resources. Although dramatic reductions can often be obtained through straightforward mappings, further performance improvements often require algorithmic redesigns to more closely exploit the target architecture. In this paper, we focus on efficient molecular simulations for the GPU and propose a novel cell list algorithm that better utilizes its parallel resources. Our goal is an efficient GPU implementation of large-scale Monte Carlo simulations for the grand canonical ensemble. This is a particularly challenging application because there is inherently less computation and parallelism than in similar applications with molecular dynamics. Consistent with the results of prior researchers, our simulation results show traditional cell list implementations for Monte Carlo simulations of molecular systems offer effectively no performance improvement for small systems [5, 14], even when porting to the GPU. However for larger systems, the cell list implementation offers significant gains in performance. Furthermore, our novel cell list approach results in better performance for all problem sizes when compared with other GPU implementations with or without cell lists.Comment: 30 page

    Lock-free Concurrent Data Structures

    Full text link
    Concurrent data structures are the data sharing side of parallel programming. Data structures give the means to the program to store data, but also provide operations to the program to access and manipulate these data. These operations are implemented through algorithms that have to be efficient. In the sequential setting, data structures are crucially important for the performance of the respective computation. In the parallel programming setting, their importance becomes more crucial because of the increased use of data and resource sharing for utilizing parallelism. The first and main goal of this chapter is to provide a sufficient background and intuition to help the interested reader to navigate in the complex research area of lock-free data structures. The second goal is to offer the programmer familiarity to the subject that will allow her to use truly concurrent methods.Comment: To appear in "Programming Multi-core and Many-core Computing Systems", eds. S. Pllana and F. Xhafa, Wiley Series on Parallel and Distributed Computin

    Harvesting graphics power for MD simulations

    Get PDF
    We discuss an implementation of molecular dynamics (MD) simulations on a graphic processing unit (GPU) in the NVIDIA CUDA language. We tested our code on a modern GPU, the NVIDIA GeForce 8800 GTX. Results for two MD algorithms suitable for short-ranged and long-ranged interactions, and a congruential shift random number generator are presented. The performance of the GPU's is compared to their main processor counterpart. We achieve speedups of up to 80, 40 and 150 fold, respectively. With newest generation of GPU's one can run standard MD simulations at 10^7 flops/$.Comment: 12 pages, 5 figures. Submitted to Mol. Si

    Distributed memory compiler methods for irregular problems: Data copy reuse and runtime partitioning

    Get PDF
    Outlined here are two methods which we believe will play an important role in any distributed memory compiler able to handle sparse and unstructured problems. We describe how to link runtime partitioners to distributed memory compilers. In our scheme, programmers can implicitly specify how data and loop iterations are to be distributed between processors. This insulates users from having to deal explicitly with potentially complex algorithms that carry out work and data partitioning. We also describe a viable mechanism for tracking and reusing copies of off-processor data. In many programs, several loops access the same off-processor memory locations. As long as it can be verified that the values assigned to off-processor memory locations remain unmodified, we show that we can effectively reuse stored off-processor data. We present experimental data from a 3-D unstructured Euler solver run on iPSC/860 to demonstrate the usefulness of our methods

    High Performance Direct Gravitational N-body Simulations on Graphics Processing Units -- II: An implementation in CUDA

    Get PDF
    We present the results of gravitational direct NN-body simulations using the Graphics Processing Unit (GPU) on a commercial NVIDIA GeForce 8800GTX designed for gaming computers. The force evaluation of the NN-body problem is implemented in ``Compute Unified Device Architecture'' (CUDA) using the GPU to speed-up the calculations. We tested the implementation on three different NN-body codes: two direct NN-body integration codes, using the 4th order predictor-corrector Hermite integrator with block time-steps, and one Barnes-Hut treecode, which uses a 2nd order leapfrog integration scheme. The integration of the equations of motions for all codes is performed on the host CPU. We find that for N>512N > 512 particles the GPU outperforms the GRAPE-6Af, if some softening in the force calculation is accepted. Without softening and for very small integration time steps the GRAPE still outperforms the GPU. We conclude that modern GPUs offer an attractive alternative to GRAPE-6Af special purpose hardware. Using the same time-step criterion, the total energy of the NN-body system was conserved better than to one in 10610^6 on the GPU, only about an order of magnitude worse than obtained with GRAPE-6Af. For N \apgt 10^5 the 8800GTX outperforms the host CPU by a factor of about 100 and runs at about the same speed as the GRAPE-6Af.Comment: Accepted for publication in New Astronom

    Parallel machine architecture and compiler design facilities

    Get PDF
    The objective is to provide an integrated simulation environment for studying and evaluating various issues in designing parallel systems, including machine architectures, parallelizing compiler techniques, and parallel algorithms. The status of Delta project (which objective is to provide a facility to allow rapid prototyping of parallelized compilers that can target toward different machine architectures) is summarized. Included are the surveys of the program manipulation tools developed, the environmental software supporting Delta, and the compiler research projects in which Delta has played a role

    Multithreading Aware Hardware Prefetching for Chip Multiprocessors

    Get PDF
    To take advantage of the processing power in the Chip Multiprocessors design, applications must be divided into semi-independent processes that can run concur- rently on multiple cores within a system. Therefore, programmers must insert thread synchronization semantics (i.e. locks, barriers, and condition variables) to synchro- nize data access between processes. Indeed, threads spend long time waiting to acquire the lock of a critical section. In addition, a processor has to stall execution to wait for load data accesses to complete. Furthermore, there are often independent instructions which include load instructions beyond synchronization semantics that could be executed in parallel while a thread waits on the synchronization semantics. The conveniences of the cache memories come with some extra cost in Chip Multiprocessors. Cache Coherence mechanisms address the Memory Consistency problem. However, Cache Coherence adds considerable overhead to memory accesses. Having aggressive prefetcher on different cores of a Chip Multiprocessor can definitely lead to significant system performance degradation when running multi-threaded applications. This result of prefetch-demand interference when a prefetcher in one core ends up pulling shared data from a producing core before it has been written, the cache block will end up transitioning back and forth between the cores and result in useless prefetch, saturating the memory bandwidth and substantially increase the latency to critical shared data. We present a hardware prefetcher that enables large performance improvements from prefetching in Chip Multiprocessors by significantly reducing prefetch-demand interference. Furthermore, it will utilize the time that a thread spends waiting on syn- chronization semantics to run ahead of the critical section to speculate and prefetch independent load instruction data beyond the synchronization semantics

    Performance Debugging and Tuning using an Instruction-Set Simulator

    Get PDF
    Instruction-set simulators allow programmers a detailed level of insight into, and control over, the execution of a program, including parallel programs and operating systems. In principle, instruction set simulation can model any target computer and gather any statistic. Furthermore, such simulators are usually portable, independent of compiler tools, and deterministic-allowing bugs to be recreated or measurements repeated. Though often viewed as being too slow for use as a general programming tool, in the last several years their performance has improved considerably. We describe SIMICS, an instruction set simulator of SPARC-based multiprocessors developed at SICS, in its rôle as a general programming tool. We discuss some of the benefits of using a tool such as SIMICS to support various tasks in software engineering, including debugging, testing, analysis, and performance tuning. We present in some detail two test cases, where we've used SimICS to support analysis and performance tuning of two applications, Penny and EQNTOTT. This work resulted in improved parallelism in, and understanding of, Penny, as well as a performance improvement for EQNTOTT of over a magnitude. We also present some early work on analyzing SPARC/Linux, demonstrating the ability of tools like SimICS to analyze operating systems
    • …
    corecore