3,017 research outputs found

    Architecture-Aware Optimization on a 1600-core Graphics Processor

    Get PDF
    The graphics processing unit (GPU) continues to make significant strides as an accelerator in commodity cluster computing for high-performance computing (HPC). For example, three of the top five fastest supercomputers in the world, as ranked by the TOP500, employ GPUs as accelerators. Despite this increasing interest in GPUs, however, optimizing the performance of a GPU-accelerated compute node requires deep technical knowledge of the underlying architecture. Although significant literature exists on how to optimize GPU performance on the more mature NVIDIA CUDA architecture, the converse is true for OpenCL on the AMD GPU. Consequently, we present and evaluate architecture-aware optimizations for the AMD GPU. The most prominent optimizations include (i) explicit use of registers, (ii) use of vector types, (iii) removal of branches, and (iv) use of image memory for global data. We demonstrate the efficacy of our AMD GPU optimizations by applying each optimization in isolation as well as in concert to a large-scale, molecular modeling application called GEM. Via these AMD-specific GPU optimizations, the AMD Radeon HD 5870 GPU delivers 65% better performance than with the wellknown NVIDIA-specific optimizations

    A survey of the state of the art and focused research in range systems, task 2

    Get PDF
    Contract generated publications are compiled which describe the research activities for the reporting period. Study topics include: equivalent configurations of systolic arrays; least squares estimation algorithms with systolic array architectures; modeling and equilization of nonlinear bandlimited satellite channels; and least squares estimation and Kalman filtering by systolic arrays

    Interactive Medical Image Registration With Multigrid Methods and Bounded Biharmonic Functions

    Get PDF
    Interactive image registration is important in some medical applications since automatic image registration is often slow and sometimes error-prone. We consider interactive registration methods that incorporate user-specified local transforms around control handles. The deformation between handles is interpolated by some smooth functions, minimizing some variational energies. Besides smoothness, we expect the impact of a control handle to be local. Therefore we choose bounded biharmonic weight functions to blend local transforms, a cutting-edge technique in computer graphics. However, medical images are usually huge, and this technique takes a lot of time that makes itself impracticable for interactive image registration. To expedite this process, we use a multigrid active set method to solve bounded biharmonic functions (BBF). The multigrid approach is for two scenarios, refining the active set from coarse to fine resolutions, and solving the linear systems constrained by working active sets. We\u27ve implemented both weighted Jacobi method and successive over-relaxation (SOR) in the multigrid solver. Since the problem has box constraints, we cannot directly use regular updates in Jacobi and SOR methods. Instead, we choose a descent step size and clamp the update to satisfy the box constraints. We explore the ways to choose step sizes and discuss their relation to the spectral radii of the iteration matrices. The relaxation factors, which are closely related to step sizes, are estimated by analyzing the eigenvalues of the bilaplacian matrices. We give a proof about the termination of our algorithm and provide some theoretical error bounds. Another minor problem we address is to register big images on GPU with limited memory. We\u27ve implemented an image registration algorithm with virtual image slices on GPU. An image slice is treated similarly to a page in virtual memory. We execute a wavefront of subtasks together to reduce the number of data transfers. Our main contribution is a fast multigrid method for interactive medical image registration that uses bounded biharmonic functions to blend local transforms. We report a novel multigrid approach to refine active set quickly and use clamped updates based on weighted Jacobi and SOR. This multigrid method can be used to efficiently solve other quadratic programs that have active sets distributed over continuous regions

    Supernode Transformation On Parallel Systems With Distributed Memory – An Analytical Approach

    Get PDF
    Supernode transformation, or tiling, is a technique that partitions algorithms to improve data locality and parallelism by balancing computation and inter-processor communication costs to achieve shortest execution or running time. It groups multiple iterations of nested loops into supernodes to be assigned to processors for processing in parallel. A supernode transformation can be described by supernode size and shape. This research focuses on supernode transformation on multi-processor architectures with distributed memory, including computer cluster systems and General Purpose Graphic Processing Units (GPGPUs). The research involves supernode scheduling, supernode mapping to processors, and the finding of the optimal supernode size, for achieving the shortest total running time. The algorithms considered are two nested loops with regular data dependencies. The Longest Common Subsequence problem is used as an illustration. A novel mathematical model for the total running time is established as a function of the supernode size, algorithm parameters such as the problem size and the data dependence, the computation time of each loop iteration, architecture parameters such as the number of processors, and the communication cost. The optimal supernode size is derived from this closed form model. The model and the optimal supernode size provide better results than previous researches and are verified by simulations on multi-processor systems including computer cluster systems and GPGPUs

    Principles for problem aggregation and assignment in medium scale multiprocessors

    Get PDF
    One of the most important issues in parallel processing is the mapping of workload to processors. This paper considers a large class of problems having a high degree of potential fine grained parallelism, and execution requirements that are either not predictable, or are too costly to predict. The main issues in mapping such a problem onto medium scale multiprocessors are those of aggregation and assignment. We study a method of parameterized aggregation that makes few assumptions about the workload. The mapping of aggregate units of work onto processors is uniform, and exploits locality of workload intensity to balance the unknown workload. In general, a finer aggregate granularity leads to a better balance at the price of increased communication/synchronization costs; the aggregation parameters can be adjusted to find a reasonable granularity. The effectiveness of this scheme is demonstrated on three model problems: an adaptive one-dimensional fluid dynamics problem with message passing, a sparse triangular linear system solver on both a shared memory and a message-passing machine, and a two-dimensional time-driven battlefield simulation employing message passing. Using the model problems, the tradeoffs are studied between balanced workload and the communication/synchronization costs. Finally, an analytical model is used to explain why the method balances workload and minimizes the variance in system behavior

    Time-dependent Hamiltonian estimation for Doppler velocimetry of trapped ions

    Full text link
    The time evolution of a closed quantum system is connected to its Hamiltonian through Schroedinger's equation. The ability to estimate the Hamiltonian is critical to our understanding of quantum systems, and allows optimization of control. Though spectroscopic methods allow time-independent Hamiltonians to be recovered, for time-dependent Hamiltonians this task is more challenging. Here, using a single trapped ion, we experimentally demonstrate a method for estimating a time-dependent Hamiltonian of a single qubit. The method involves measuring the time evolution of the qubit in a fixed basis as a function of a time-independent offset term added to the Hamiltonian. In our system the initially unknown Hamiltonian arises from transporting an ion through a static, near-resonant laser beam. Hamiltonian estimation allows us to estimate the spatial dependence of the laser beam intensity and the ion's velocity as a function of time. This work is of direct value in optimizing transport operations and transport-based gates in scalable trapped ion quantum information processing, while the estimation technique is general enough that it can be applied to other quantum systems, aiding the pursuit of high operational fidelities in quantum control.Comment: 10 pages, 8 figure
    corecore