19 research outputs found
A data relocation approach for terrain surface analysis on multi-GPU systems: a case study on the total viewshed problem
Digital Elevation Models (DEMs) are important datasets for modelling the line
of sight, such as radio signals, sound waves and human vision. These are
commonly analyzed using rotational sweep algorithms. However, such algorithms
require large numbers of memory accesses to 2D arrays which, despite being
regular, result in poor data locality in memory. Here, we propose a new
methodology called skewed Digital Elevation Model (sDEM), which substantially
improves the locality of memory accesses and increases the inherent parallelism
involved in the computation of rotational sweep-based algorithms. In
particular, sDEM applies a data restructuring technique before accessing the
memory and performing the computation. To demonstrate the high efficiency of
sDEM, we use the problem of total viewshed computation as a case study
considering different implementations for single-core, multi-core, single-GPU
and multi-GPU platforms. We conducted two experiments to compare sDEM with (i)
the most commonly used geographic information systems (GIS) software and (ii)
the state-of-the-art algorithm. In the first experiment, sDEM is on average
8.8x faster than current GIS software despite being able to consider only few
points because of their limitations. In the second experiment, sDEM is 827.3x
faster than the state-of-the-art algorithm in the best case
TOMOBFLOW: feature-preserving noise filtering for electron tomography
<p>Abstract</p> <p>Background</p> <p>Noise filtering techniques are needed in electron tomography to allow proper interpretation of datasets. The standard linear filtering techniques are characterized by a tradeoff between the amount of reduced noise and the blurring of the features of interest. On the other hand, sophisticated anisotropic nonlinear filtering techniques allow noise reduction with good preservation of structures. However, these techniques are computationally intensive and are difficult to be tuned to the problem at hand.</p> <p>Results</p> <p>TOMOBFLOW is a program for noise filtering with capabilities of preservation of biologically relevant information. It is an efficient implementation of the Beltrami flow, a nonlinear filtering method that locally tunes the strength of the smoothing according to an edge indicator based on geometry properties. The fact that this method does not have free parameters hard to be tuned makes TOMOBFLOW a user-friendly filtering program equipped with the power of diffusion-based filtering methods. Furthermore, TOMOBFLOW is provided with abilities to deal with different types and formats of images in order to make it useful for electron tomography in particular and bioimaging in general.</p> <p>Conclusion</p> <p>TOMOBFLOW allows efficient noise filtering of bioimaging datasets with preservation of the features of interest, thereby yielding data better suited for post-processing, visualization and interpretation. It is available at the web site <url>http://www.ual.es/%7ejjfdez/SW/tomobflow.html</url>.</p
A tuning approach for iterative multiple 3d stencil pipeline on GPUs: Anisotropic Nonlinear Diffusion algorithm as case study
This paper focuses on challenging applications that can be expressed as an iterative pipeline of multiple 3d stencil stages and explores their optimization space on GPUs. For this study, we selected a representative example from the field of digital signal processing, the Anisotropic Nonlinear Diffusion algorithm. An open issue to these applications is to determine the optimal fission/fusion level of the involved stages and whether that combination benefits from data tiling. This implies exploring a large space of all the possible fission/fusion combinations with and without tiling, thus making the process non-trivial. This study provides insights to reduce the optimization tuning space and programming effort of iterative multiple 3d stencils. Our results demonstrate that all combinations that fuse the bottleneck stencil with high halos update cost (> 25 % , this percentage can be measured or estimated experimentally for each single stencil) and high registers and shared memory accesses must not be considered in the exploration process. The optimal fission/fusion combination is up to 1.65Ă— faster than the case in which we fully decompose our stencil without tiling and 5.3Ă— faster with respect to the fully fused version on the NVIDIA GPUs
Demystifying the 16 Ă— 16 thread-block for stencils on the GPU
\u3cp\u3eSummary Stencil computation is of paramount importance in many fields, in image processing, structural biology and biomedicine, among others. There exists a permanent demand of maximizing the performance of stencils on state-of-the-art architectures, such graphics processing units (GPUS). One of the important issues when optimizing these kernels for the GPU is the selection of the best thread-block that maximizes the overall performance. Usually, programmers look for the optimal thread-block configuration in a reduced space of square thread-block configurations or simply use the best configurations reported in previous works, which is usually 16 Ă— 16. This paper provides a better understanding of the impact of thread-block configurations on the performance of stencils on the GPU. In particular, we model locality and parallelism and consider that the optimal configurations are within the space that provides: (1) a small number of global memory communications; (2) a good shared memory utilization with small numbers of conflicts; (3) a good streaming multi-processors utilization; and (4) a high efficiency of the threads within a thread-block. The model determines the set of optimal thread-block configurations without the need of executing the code. We validate the proposed model using six stencils with different halo widths and show that it reduces the optimization space to around 25% of the total valid space. The configurations in this space achieve at least a throughput of 75% of the best configuration and guarantee the inclusion of the best configurations.\u3c/p\u3