974 research outputs found
Computing the Component-Labeling and the Adjacency Tree of a Binary Digital Image in Near Logarithmic-Time
Connected component labeling (CCL) of binary images is
one of the fundamental operations in real time applications. The adjacency
tree (AdjT) of the connected components offers a region-based
representation where each node represents a region which is surrounded
by another region of the opposite color. In this paper, a fully parallel
algorithm for computing the CCL and AdjT of a binary digital image
is described and implemented, without the need of using any geometric
information. The time complexity order for an image of m × n pixels
under the assumption that a processing element exists for each pixel is
near O(log(m+ n)). Results for a multicore processor show a very good
scalability until the so-called memory bandwidth bottleneck is reached.
The inherent parallelism of our approach points to the direction that
even better results will be obtained in other less classical computing
architectures.Ministerio de EconomÃa y Competitividad MTM2016-81030-PMinisterio de EconomÃa y Competitividad TEC2012-37868-C04-0
Image Processing for Multiple-Target Tracking on a Graphics Processing Unit
Multiple-target tracking (MTT) systems have been implemented on many different platforms, however these solutions are often expensive and have long development times. Such MTT implementations require custom hardware, yet offer very little flexibility with ever changing data sets and target tracking requirements. This research explores how to supplement and enhance MTT performance with an existing graphics processing unit (GPU) on a general computing platform. Typical computers are already equipped with powerful GPUs to support various games and multimedia applications. However, such GPUs are not currently being used in desktop MTT applications. This research explores if and how a GPU can be used to supplement and enhance MTT implementations on a flexible common desktop computer without requiring costly dedicated MTT hardware and software. A MTT system was developed in MATLAB to provide baseline performance metrics for processing 24-bit, 1920x1080 color video footage filmed at 30 frames per second. The baseline MATLAB implementation is further enhanced with various custom C functions to speed up the MTT implementation for fair comparison and analysis. From the MATLAB MTT implementation, this research identifies potential areas of improvement through use of the GPU. The bottleneck image processing functions (frame differencing) were converted to execute on the GPU. On average, the GPU code executed 287% faster than the MATLAB implementation. Some individual functions actually executed 20 times faster than the baseline. These results indicate that the GPU is a viable source to significantly increase the performance of MTT with a low-cost hardware solution
GPU-based Swendsen-Wang multi-cluster algorithm for the simulation of two-dimensional classical spin systems
We present the GPU calculation with the common unified device architecture
(CUDA) for the Swendsen-Wang multi-cluster algorithm of two-dimensional
classical spin systems. We adjust the two connected component labeling
algorithms recently proposed with CUDA for the assignment of the cluster in the
Swendsen-Wang algorithm. Starting with the q-state Potts model, we extend our
implementation to the system of vector spins, the q-state clock model, with the
idea of embedded cluster. We test the performance, and the calculation time on
GTX580 is obtained as 2.51 nano sec per a spin flip for the q=2 Potts model
(Ising model) and 2.42 nano sec per a spin flip for the q=6 clock model with
the linear size L=4096 at the critical temperature, respectively. The
computational speed for the q=2 Potts model on GTX580 is 12.4 times as fast as
the calculation speed on a current CPU core. That for the q=6 clock model on
GTX580 is 35.6 times as fast as the calculation speed on a current CPU core.Comment: accepted for publication in Comp. Phys. Commu
Multi-GPU-based Swendsen-Wang multi-cluster algorithm for the simulation of two-dimensional q-state Potts model
We present the multiple GPU computing with the common unified device
architecture (CUDA) for the Swendsen-Wang multi-cluster algorithm of
two-dimensional (2D) q-state Potts model. Extending our algorithm for single
GPU computing [Comp. Phys. Comm. 183 (2012) 1155], we realize the GPU
computation of the Swendsen-Wang multi-cluster algorithm for multiple GPUs. We
implement our code on the large-scale open science supercomputer TSUBAME 2.0,
and test the performance and the scalability of the simulation of the 2D Potts
model. The performance on Tesla M2050 using 256 GPUs is obtained as 37.3 spin
flips per a nano second for the q=2 Potts model (Ising model) at the critical
temperature with the linear system size L=65536.Comment: accepted for publication in Comp. Phys. Commun. arXiv admin note:
substantial text overlap with arXiv:1202.063
How does Connected Components Labeling with Decision Trees perform on GPUs?
In this paper the problem of Connected Components Labeling (CCL) in binary images using Graphic Processing Units (GPUs) is tackled by a different perspective. In the last decade, many novel algorithms have been released, specifically designed for GPUs. Because CCL literature concerning sequential algorithms is very rich, and includes many efficient solutions, designers of parallel algorithms were often inspired by techniques that had already proved successful in a sequential environment, such as the Union-Find paradigm for solving equivalences between provisional labels. However, the use of decision trees to minimize memory accesses, which is one of the main feature of the best performing sequential algorithms, was never taken into account when designing parallel CCL solutions. In fact, branches in the code tend to cause thread divergence, which usually leads to inefficiency. Anyway, this consideration does not necessarily apply to every possible scenario. Are we sure that the advantages of decision trees do not compensate for the cost of thread divergence? In order to answer this question, we chose three well-known sequential CCL algorithms, which employ decision trees as the cornerstone of their strategy, and we built a data-parallel version of each of them. Experimental tests on real case datasets show that, in most cases, these solutions outperform state-of-the-art algorithms, thus demonstrating the effectiveness of decision trees also in a parallel environment
A Block-Based Union-Find Algorithm to Label Connected Components on GPUs
In this paper, we introduce a novel GPU-based Connected Components Labeling algorithm: the Block-based Union Find. The proposed strategy significantly improves an existing GPU algorithm, taking advantage of a block-based approach. Experimental results on real cases and synthetically generated datasets demonstrate the superiority of the new proposal with respect to state-of-the-art
- …