7 research outputs found

    A GPU-enabled solver for time-constrained linear sum assignment problems

    Get PDF
    This paper deals with solving large instances of the Linear Sum Assignment Problems (LSAPs) under realtime constraints, using Graphical Processing Units (GPUs). The motivating scenario is an industrial application for P2P live streaming that is moderated by a central tracker that is periodically solving LSAP instances to optimize the connectivity of thousands of peers. However, our findings are generic enough to be applied in other contexts. Our main contribution is a parallel version of a heuristic algorithm called Deep Greedy Switching (DGS) on GPUs using the CUDA programming language. DGS sacrifices absolute optimality in favor of a substantial speedup in comparison to classical LSAP solvers like the Hungarian and auctioning methods. We show the modifications needed to parallelize the DGS algorithm and the performance gains of our approach compared to a sequential CPU-based implementation of DGS and a mixed CPU/GPU-based implementation of it

    Design of experiment using simulation of a discrete dynamical system

    Get PDF
    The topic of the presented paper is a promising approach to achieve optimal Design of Experi- ment (DoE), i.e. spreading of points within a design domain, using a simulation of a discrete dynamical system of interacting particles within an n-dimensional design space. The system of mutually repelling particles represents a physical analogy of the Audze-Eglājs (AE) optimization criterion and its period- ical modification (PAE), respectively. The paper compares the performance of two approaches to im- plementation: a single-thread process using the JAVA language environment and a massively parallel solution employing the nVidia CUDA platform

    Ray traced rendering using GPGPU devices

    Get PDF

    GPUfs: Integrating a File System with GPUs

    Get PDF
    As GPU hardware becomes increasingly general-purpose, it is quickly outgrowing the traditional, constrained GPU-as-coprocessor programming model. To make GPUs easier to program and improve their integration with operating systems, we propose making the host’s file system directly accessible to GPU code. GPUfs provides a POSIX-like API for GPU programs, exploits GPU parallelism for efficiency, and optimizes GPU file access by extending the host CPU’s buffer cache into GPU memory. Our experiments, based on a set of real benchmarks adapted to use our file system, demonstrate the feasibility and benefits of the GPUfs approach. For example, a self-contained GPU program that searches for a set of strings throughout the Linux kernel source tree runs over seven times faster than on an eight-core CPU

    View-Dependent Visualization for Analysis of Large Datasets

    Get PDF
    Due to the impressive capabilities of human visual processing, interactive visualization methods have become essential tools for scientists to explore and analyze large, complex datasets. However, traditional approaches do not account for the increased size or latency of data retrieval when interacting with these often remote datasets. In this dissertation, I discuss two novel design paradigms, based on accepted models of the information visualization process and graphics hardware pipeline, that are appropriate for interactive visualization of large remote datasets. In particular, I discuss novel solutions aimed at improving the performance of interactive visualization systems when working with large numeric datasets and large terrain (elevation and imagery) datasets by using data reduction and asynchronous retrieval of view-prioritized data, respectively. First I present a modified version of the standard information visualization model that accounts for the challenges presented by interacting with large, remote datasets. I also provide the details of a software framework implemented using this model and discuss several different visualization applications developed within this framework. Next I present a novel technique for leveraging the hardware graphics pipeline to provide asynchronous, view-prioritized data retrieval to support interactive visualization of remote terrain data. I provide the results of statistical analysis of performance metrics to demonstrate the effectiveness of this approach. Finally I present the details of two novel visualization techniques, and the results of evaluating these systems using controlled user studies and expert evaluation. The results of these qualitative and quantitative evaluation mechanisms demonstrate improved visual analysis task performance for large numeric datasets

    Parallel algorithms and efficient implementation techniques for finite element approximations

    Get PDF
    In this thesis we study the efficient implementation of the finite element method for the numerical solution of partial differential equations (PDE) on modern parallel computer archi- tectures, such as Cray and IBM supercomputers. The domain-decomposition (DD) method represents the basis of parallel finite element software and is generally implemented such that the number of subdomains is equal to the number of MPI processes. We are interested in breaking this paradigm by introducing a second level of parallelism. Each subdomain is assigned to more than one processor and either MPI processes or multiple threads are used to implement the parallelism on the second level. The thesis is devoted to the study of this second level of parallelism and includes the stages described below. The algebraic additive Schwarz (AAS) domain-decomposition preconditioner is an integral part of the solution process. We seek to understand its performance on the parallel computers which we target and we introduce an improved construction approach for the parallel precon- ditioner. We examine a novel strategy for solving the AAS subdomain problems, using multiple MPI processes. At the subdomain level, this is represented by the ShyLU preconditioner. We bring improvements to its algorithm in the form of a novel inexact solver based on an incomplete QR (IQR) factorization. The performance of the new preconditioner framework is studied for Laplacian and advection-diffusion-reaction (ADR) problems and for Navier-Stokes problems, as a component within a larger framework of specialized preconditioners. The partitioning of the computational mesh comes with considerable memory limitations, when done at runtime on parallel computers, due to the low amount of available memory per processor. We describe and implement a solution to this problem, based on offloading the partitioning process to a preliminary offline stage of the simulation process. We also present the efficient implementation, based on parallel MPI collective instructions, of the routines which load the mesh parts during the simulation. We discuss an alternative parallel implementation of the finite element system assembly based on multi-threading. This new approach is used to supplement the existing one based on MPI parallelism, in situations where MPI alone can not make use of all the available parallel hardware resources. The work presented in the thesis has been done in the framework of two software projects: the Trilinos project and the LifeV parallel finite element modeling library. All the new develop- ments have been contributed back to the respective projects, to be used freely in subsequent public releases of the software
    corecore