67 research outputs found

    A virtualized software based on the NVIDIA cuFFT library for image denoising:performance analysis

    Get PDF
    Generic Virtualization Service (GVirtuS) is a new solution for enabling GPGPU on Virtual Machines or low powered devices. This paper focuses on the performance analysis that can be obtained using a GPGPU virtualized software. Recently, GVirtuS has been extended in order to support CUDA ancillary libraries with good results. Here, our aim is to analyze the applicability of this powerful tool to a real problem, which uses the NVIDIA cuFFT library. As case study we consider a simple denoising algorithm, implementing a virtualized GPU-parallel software based on the convolution theorem in order to perform the noise removal procedure in the frequency domain. We report some preliminary tests in both physical and virtualized environments to study and analyze the potential scalability of such an algorithm. Peer-review under responsibility of the Conference Program Chairs

    Increasing the performance of the Wetland DEM Ponding Model using multiple GPUs

    Get PDF
    Due to the lack of conventional drainage systems on the Canadian Prairies, when excess water runs off the landscape because of the snow-melt and heavy rainfall, the water may be trapped in surface depressions ranging in size from puddles to permanent wetlands and may cause local flooding. Hydrological processes play an important role in the Canadian Prairies regions, and using hydrological simulation models helps people understand past hydrological events and predict future ones. In order to obtain an accurate simulation, higher-resolution systems and larger simulation areas are introduced, and those lead to the need to solve larger-scale problems. However, the size of the problem is often limited by available computational resources, and solving large systems results in unacceptable simulation durations. Therefore, improving the computational efficiency and taking advantage of available computational resources is an urgent task for hydrological researchers and software developers. The Wetland DEM Ponding Model (WDPM) was developed to model the distribution of runoff water on the Canadian Prairies. It helps determine the fraction of Prairie basins contributing flows to stream while these change dynamically with water storage in the depressions. In the WDPM, the water redistribution module is the most computationally intensive part. Previously, the WDPM has been developed to run in parallel with one CPU or one GPU that makes the water redistribution module more efficient. Multi-device parallel computing is a common method to increase the available computation resources and could effectively speed up the application with an appropriate parallel algorithm. This thesis develops a multiple-GPU parallel algorithm and investigates efficient data transmission methods compared to the CPU parallel and one-GPU parallel algorithm. A technique that overlaps communication with computation is applied to optimize the parallel computing process. Then the thesis evaluates the new implementation from several aspects. In the first step, the output summary and the output system are compared between the new implementation and the initial one. The solution shows significant convergence as the simulation processes, verifying the new implementation produces the correct result. In the second step, the multiple-GPU code is profiled, and it is verified that the algorithm can be re-organized to take advantage of multiple GPUs and carry out efficient data synchronization through optimized techniques. Finally, by means of numerical experiments, the new implementation shows performance improvement when using multiple GPUs and demonstrates good scaling. In particular, when working with a large system, the multiple-GPU implementation produces correct output and shows that there is around 2.35 times improvement in the performance compared using four GPUs with using one GPU

    GPU Parallel Computation of Morse-Smale Complexes

    Full text link
    The Morse-Smale complex is a well studied topological structure that represents the gradient flow behavior of a scalar function. It supports multi-scale topological analysis and visualization of large scientific data. Its computation poses significant algorithmic challenges when considering large scale data and increased feature complexity. Several parallel algorithms have been proposed towards the fast computation of the 3D Morse-Smale complex. The non-trivial structure of the saddle-saddle connections are not amenable to parallel computation. This paper describes a fine grained parallel method for computing the Morse-Smale complex that is implemented on a GPU. The saddle-saddle reachability is first determined via a transformation into a sequence of vector operations followed by the path traversal, which is achieved via a sequence of matrix operations. Computational experiments show that the method achieves up to 7x speedup over current shared memory implementations

    Anomalous metastability in a temperature-driven transition

    Full text link
    Langer theory of metastability provides a description of the lifetime and properties of the metastable phase of the Ising model field-driven transition, describing the magnetic field-driven transition in ferromagnets and the chemical potential-driven transition of fluids. An immediate further step is to apply it to the study of a transition driven by the temperature, as the one underwent by the two-dimensional Potts model. For this model a study based on the analytical continuation of the free energy (Meunier, Morel 2000) predicts the anomalous vanishing of the metastable temperature range in the limit of large system size, an issue that has been controversial since the eighties. With a parallel-GPU algorithm we compare the Monte Carlo dynamics with the theory, obtaining agreement and characterizing the anomalous system size dependence. We discuss the microscopic origin of these metastable phenomena, essentially different with respect to the Ising case.Comment: 5 pages, 3 figure

    Efficient Parallel Implementation of Active Appearance Model Fitting Algorithm on GPU

    Get PDF
    The active appearance model (AAM) is one of the most powerful model-based object detecting and tracking methods which has been widely used in various situations. However, the high-dimensional texture representation causes very time-consuming computations, which makes the AAM difficult to apply to real-time systems. The emergence of modern graphics processing units (GPUs) that feature a many-core, fine-grained parallel architecture provides new and promising solutions to overcome the computational challenge. In this paper, we propose an efficient parallel implementation of the AAM fitting algorithm on GPUs. Our design idea is fine grain parallelism in which we distribute the texture data of the AAM, in pixels, to thousands of parallel GPU threads for processing, which makes the algorithm fit better into the GPU architecture. We implement our algorithm using the compute unified device architecture (CUDA) on the Nvidia’s GTX 650 GPU, which has the latest Kepler architecture. To compare the performance of our algorithm with different data sizes, we built sixteen face AAM models of different dimensional textures. The experiment results show that our parallel AAM fitting algorithm can achieve real-time performance for videos even on very high-dimensional textures

    Parallel Motion Simulation of Large-Scale Real-Time Crowd in a Hierarchical Environmental Model

    Get PDF
    This paper presents a parallel real-time crowd simulation method based on a hierarchical environmental model. A dynamical model of the complex environment should be constructed to simulate the state transition and propagation of individual motions. By modeling of a virtual environment where virtual crowds reside, we employ different parallel methods on a topological layer, a path layer and a perceptual layer. We propose a parallel motion path matching method based on the path layer and a parallel crowd simulation method based on the perceptual layer. The large-scale real-time crowd simulation becomes possible with these methods. Numerical experiments are carried out to demonstrate the methods and results

    GPU Acceleration of a Non-Standard Finite Element Mesh Truncation Technique for Electromagnetics

    Get PDF
    The emergence of General Purpose Graphics Processing Units (GPGPUs) provides new opportunities to accelerate applications involving a large number of regular computations. However, properly leveraging the computational resources of graphical processors is a very challenging task. In this paper, we use this kind of device to parallelize FE-IIEE (Finite Element-Iterative Integral Equation Evaluation), a non-standard finite element mesh truncation technique introduced by two of the authors. This application is computationally very demanding due to the amount, size and complexity of the data involved in the procedure. Besides, an efficient implementation becomes even more difficult if the parallelization has to maintain the complex workflow of the original code. The proposed implementation using CUDA applies different optimization techniques to improve performance. These include leveraging the fastest memories of the GPU and increasing the granularity of the computations to reduce the impact of memory access. We have applied our parallel algorithm to two real radiation and scattering problems demonstrating speedups higher than 140 on a state-of-the-art GPU.This work was supported in part by the Spanish Government under Grant TEC2016-80386-P, Grant TIN2017-82972-R, and Grant ESP2015-68245-C4-1-P, and in part by the Valencian Regional Government under Grant PROMETEO/2019/109
    • …
    corecore