214 research outputs found

    A GPU-enabled solver for time-constrained linear sum assignment problems

    Get PDF
    This paper deals with solving large instances of the Linear Sum Assignment Problems (LSAPs) under realtime constraints, using Graphical Processing Units (GPUs). The motivating scenario is an industrial application for P2P live streaming that is moderated by a central tracker that is periodically solving LSAP instances to optimize the connectivity of thousands of peers. However, our findings are generic enough to be applied in other contexts. Our main contribution is a parallel version of a heuristic algorithm called Deep Greedy Switching (DGS) on GPUs using the CUDA programming language. DGS sacrifices absolute optimality in favor of a substantial speedup in comparison to classical LSAP solvers like the Hungarian and auctioning methods. We show the modifications needed to parallelize the DGS algorithm and the performance gains of our approach compared to a sequential CPU-based implementation of DGS and a mixed CPU/GPU-based implementation of it

    Comparison of Different Parallel Implementations of the 2+1-Dimensional KPZ Model and the 3-Dimensional KMC Model

    Full text link
    We show that efficient simulations of the Kardar-Parisi-Zhang interface growth in 2 + 1 dimensions and of the 3-dimensional Kinetic Monte Carlo of thermally activated diffusion can be realized both on GPUs and modern CPUs. In this article we present results of different implementations on GPUs using CUDA and OpenCL and also on CPUs using OpenCL and MPI. We investigate the runtime and scaling behavior on different architectures to find optimal solutions for solving current simulation problems in the field of statistical physics and materials science.Comment: 14 pages, 8 figures, to be published in a forthcoming EPJST special issue on "Computer simulations on GPU

    Simulation of 1+1 dimensional surface growth and lattices gases using GPUs

    Get PDF
    Restricted solid on solid surface growth models can be mapped onto binary lattice gases. We show that efficient simulation algorithms can be realized on GPUs either by CUDA or by OpenCL programming. We consider a deposition/evaporation model following Kardar-Parisi-Zhang growth in 1+1 dimensions related to the Asymmetric Simple Exclusion Process and show that for sizes, that fit into the shared memory of GPUs one can achieve the maximum parallelization speedup ~ x100 for a Quadro FX 5800 graphics card with respect to a single CPU of 2.67 GHz). This permits us to study the effect of quenched columnar disorder, requiring extremely long simulation times. We compare the CUDA realization with an OpenCL implementation designed for processor clusters via MPI. A two-lane traffic model with randomized turning points is also realized and the dynamical behavior has been investigated.Comment: 20 pages 12 figures, 1 table, to appear in Comp. Phys. Com

    Activity recognition from videos with parallel hypergraph matching on GPUs

    Full text link
    In this paper, we propose a method for activity recognition from videos based on sparse local features and hypergraph matching. We benefit from special properties of the temporal domain in the data to derive a sequential and fast graph matching algorithm for GPUs. Traditionally, graphs and hypergraphs are frequently used to recognize complex and often non-rigid patterns in computer vision, either through graph matching or point-set matching with graphs. Most formulations resort to the minimization of a difficult discrete energy function mixing geometric or structural terms with data attached terms involving appearance features. Traditional methods solve this minimization problem approximately, for instance with spectral techniques. In this work, instead of solving the problem approximatively, the exact solution for the optimal assignment is calculated in parallel on GPUs. The graphical structure is simplified and regularized, which allows to derive an efficient recursive minimization algorithm. The algorithm distributes subproblems over the calculation units of a GPU, which solves them in parallel, allowing the system to run faster than real-time on medium-end GPUs

    Analysis of A Splitting Approach for the Parallel Solution of Linear Systems on GPU Cards

    Full text link
    We discuss an approach for solving sparse or dense banded linear systems Ax=b{\bf A} {\bf x} = {\bf b} on a Graphics Processing Unit (GPU) card. The matrix ARN×N{\bf A} \in {\mathbb{R}}^{N \times N} is possibly nonsymmetric and moderately large; i.e., 10000N50000010000 \leq N \leq 500000. The ${\it split\ and\ parallelize}( ({\tt SaP})approachseekstopartitionthematrix) approach seeks to partition the matrix {\bf A}intodiagonalsubblocks into diagonal sub-blocks {\bf A}_i,, i=1,\ldots,P,whichareindependentlyfactoredinparallel.Thesolutionmaychoosetoconsiderortoignorethematricesthatcouplethediagonalsubblocks, which are independently factored in parallel. The solution may choose to consider or to ignore the matrices that couple the diagonal sub-blocks {\bf A}_i.Thisapproach,alongwiththeKrylovsubspacebasediterativemethodthatitpreconditions,areimplementedinasolvercalled. This approach, along with the Krylov subspace-based iterative method that it preconditions, are implemented in a solver called {\tt SaP::GPU},whichiscomparedintermsofefficiencywiththreecommonlyusedsparsedirectsolvers:, which is compared in terms of efficiency with three commonly used sparse direct solvers: {\tt PARDISO},, {\tt SuperLU},and, and {\tt MUMPS}.. {\tt SaP::GPU},whichrunsentirelyontheGPUexceptseveralstagesinvolvedinpreliminaryrowcolumnpermutations,isrobustandcompareswellintermsofefficiencywiththeaforementioneddirectsolvers.InacomparisonagainstIntels, which runs entirely on the GPU except several stages involved in preliminary row-column permutations, is robust and compares well in terms of efficiency with the aforementioned direct solvers. In a comparison against Intel's {\tt MKL},, {\tt SaP::GPU}alsofareswellwhenusedtosolvedensebandedsystemsthatareclosetobeingdiagonallydominant. also fares well when used to solve dense banded systems that are close to being diagonally dominant. {\tt SaP::GPU}$ is publicly available and distributed as open source under a permissive BSD3 license.Comment: 38 page

    Real-time multitarget tracking for sensor-based sorting – A new implementation of the auction algorithm for graphics processing units

    Get PDF
    Utilizing parallel algorithms is an established way of increasing performance in systems that are bound to real-time restrictions. Sensor-based sorting is a machine vision application for which firm real-time requirements need to be respected in order to reliably remove potentially harmful entities from a material feed. Recently, employing a predictive tracking approach using multitarget tracking in order to decrease the error in the physical separation in optical sorting has been proposed. For implementations that use hard associations between measurements and tracks, a linear assignment problem has to be solved for each frame recorded by a camera. The auction algorithm can be utilized for this purpose, which also has the advantage of being well suited for parallel architectures. In this paper, an improved implementation of this algorithm for a graphics processing unit (GPU) is presented. The resulting algorithm is implemented in both an OpenCL and a CUDA based environment. By using an optimized data structure, the presented algorithm outperforms recently proposed implementations in terms of speed while retaining the quality of output of the algorithm. Furthermore, memory requirements are significantly decreased, which is important for embedded systems. Experimental results are provided for two different GPUs and six datasets. It is shown that the proposed approach is of particular interest for applications dealing with comparatively large problem sizes

    Real-time people tracking in a camera network

    Get PDF
    Visual tracking is a fundamental key to the recognition and analysis of human behaviour. In this thesis we present an approach to track several subjects using multiple cameras in real time. The tracking framework employs a numerical Bayesian estimator, also known as a particle lter, which has been developed for parallel implementation on a Graphics Processing Unit (GPU). In order to integrate multiple cameras into a single tracking unit we represent the human body by a parametric ellipsoid in a 3D world. The elliptical boundary can be projected rapidly, several hundred times per subject per frame, onto any image for comparison with the image data within a likelihood model. Adding variables to encode visibility and persistence into the state vector, we tackle the problems of distraction and short-period occlusion. However, subjects may also disappear for longer periods due to blind spots between cameras elds of view. To recognise a desired subject after such a long-period, we add coloured texture to the ellipsoid surface, which is learnt and retained during the tracking process. This texture signature improves the recall rate from 60% to 70-80% when compared to state only data association. Compared to a standard Central Processing Unit (CPU) implementation, there is a signi cant speed-up ratio
    corecore