4,406 research outputs found

    Accelerating Monte Carlo simulations with an NVIDIA® graphics processor

    Get PDF
    Modern graphics cards, commonly used in desktop computers, have evolved beyond a simple interface between processor and display to incorporate sophisticated calculation engines that can be applied to general purpose computing. The Monte Carlo algorithm for modelling photon transport in turbid media has been implemented on an NVIDIA® 8800gt graphics card using the CUDA toolkit. The Monte Carlo method relies on following the trajectory of millions of photons through the sample, often taking hours or days to complete. The graphics-processor implementation, processing roughly 110 million scattering events per second, was found to run more than 70 times faster than a similar, single-threaded implementation on a 2.67 GHz desktop computer

    MPF: A portable message passing facility for shared memory multiprocessors

    Get PDF
    The design, implementation, and performance evaluation of a message passing facility (MPF) for shared memory multiprocessors are presented. The MPF is based on a message passing model conceptually similar to conversations. Participants (parallel processors) can enter or leave a conversation at any time. The message passing primitives for this model are implemented as a portable library of C function calls. The MPF is currently operational on a Sequent Balance 21000, and several parallel applications were developed and tested. Several simple benchmark programs are presented to establish interprocess communication performance for common patterns of interprocess communication. Finally, performance figures are presented for two parallel applications, linear systems solution, and iterative solution of partial differential equations

    A message passing kernel for the hypercluster parallel processing test bed

    Get PDF
    A Message-Passing Kernel (MPK) for the Hypercluster parallel-processing test bed is described. The Hypercluster is being developed at the NASA Lewis Research Center to support investigations of parallel algorithms and architectures for computational fluid and structural mechanics applications. The Hypercluster resembles the hypercube architecture except that each node consists of multiple processors communicating through shared memory. The MPK efficiently routes information through the Hypercluster, using a message-passing protocol when necessary and faster shared-memory communication whenever possible. The MPK also interfaces all of the processors with the Hypercluster operating system (HYCLOPS), which runs on a Front-End Processor (FEP). This approach distributes many of the I/O tasks to the Hypercluster processors and eliminates the need for a separate I/O support program on the FEP

    CoreTSAR: Task Scheduling for Accelerator-aware Runtimes

    Get PDF
    Heterogeneous supercomputers that incorporate computational accelerators such as GPUs are increasingly popular due to their high peak performance, energy efficiency and comparatively low cost. Unfortunately, the programming models and frameworks designed to extract performance from all computational units still lack the flexibility of their CPU-only counterparts. Accelerated OpenMP improves this situation by supporting natural migration of OpenMP code from CPUs to a GPU. However, these implementations currently lose one of OpenMP’s best features, its flexibility: typical OpenMP applications can run on any number of CPUs. GPU implementations do not transparently employ multiple GPUs on a node or a mix of GPUs and CPUs. To address these shortcomings, we present CoreTSAR, our runtime library for dynamically scheduling tasks across heterogeneous resources, and propose straightforward extensions that incorporate this functionality into Accelerated OpenMP. We show that our approach can provide nearly linear speedup to four GPUs over only using CPUs or one GPU while increasing the overall flexibility of Accelerated OpenMP

    Simulation models of shared-memory multiprocessor systems

    Get PDF

    Fast processing of grid maps using graphical multiprocessors

    Get PDF
    Grid mapping is a very common technique used in mobile robotics to build a continuous 2D representation of the environment useful for navigation purposes. Although its computation is quite simple and fast, this algorithm uses the hypothesis of a known robot pose. In practice, this can require the re-computation of the map when the estimated robot poses change, as when a loop closure is detected. This paper presents a parallelization of a reference implementation of the grid mapping algorithm, which is suitable to be fully run on a graphics card showing huge processing speedups (up to 50×) while fully releasing the main processor, which can be very useful for many Simultaneous Localization and Mapping algorithms

    Real-time optical micro-manipulation using optimized holograms generated on the GPU

    Full text link
    Holographic optical tweezers allow the three dimensional, dynamic, multipoint manipulation of micron sized dielectric objects. Exploiting the massive parallel architecture of modern GPUs we can generate highly optimized holograms at video frame rate allowing the interactive micro-manipulation of 3D structures.Comment: 13 pages, 8 figure

    GPU acceleration of brain image proccessing

    Get PDF
    Durante los últimos años se ha venido demostrando el alto poder computacional que ofrecen las GPUs a la hora de resolver determinados problemas. Al mismo tiempo, existen campos en los que no es posible beneficiarse completamente de las mejoras conseguidas por los investigadores, debido principalmente a que los tiempos de ejecución de las aplicaciones llegan a ser extremadamente largos. Este es por ejemplo el caso del registro de imágenes en medicina. A pesar de que se han conseguido aceleraciones sobre el registro de imágenes, su uso en la práctica clínica es aún limitado. Entre otras cosas, esto se debe al rendimiento conseguido. Por lo tanto se plantea como objetivo de este proyecto, conseguir mejorar los tiempos de ejecución de una aplicación dedicada al resgitro de imágenes en medicina, con el fin de ayudar a aliviar este problema

    The force on the flex: Global parallelism and portability

    Get PDF
    A parallel programming methodology, called the force, supports the construction of programs to be executed in parallel by an unspecified, but potentially large, number of processes. The methodology was originally developed on a pipelined, shared memory multiprocessor, the Denelcor HEP, and embodies the primitive operations of the force in a set of macros which expand into multiprocessor Fortran code. A small set of primitives is sufficient to write large parallel programs, and the system has been used to produce 10,000 line programs in computational fluid dynamics. The level of complexity of the force primitives is intermediate. It is high enough to mask detailed architectural differences between multiprocessors but low enough to give the user control over performance. The system is being ported to a medium scale multiprocessor, the Flex/32, which is a 20 processor system with a mixture of shared and local memory. Memory organization and the type of processor synchronization supported by the hardware on the two machines lead to some differences in efficient implementations of the force primitives, but the user interface remains the same. An initial implementation was done by retargeting the macros to Flexible Computer Corporation's ConCurrent C language. Subsequently, the macros were caused to directly produce the system calls which form the basis for ConCurrent C. The implementation of the Fortran based system is in step with Flexible Computer Corporations's implementation of a Fortran system in the parallel environment
    corecore