1,012 research outputs found
A parallel generalized relaxation method for high-performance image segmentation on GPUs
Fast and scalable software modules for image segmentation are needed for modern high-throughput screening platforms in Computational Biology. Indeed, accurate segmentation is one of the main steps to be applied in a basic software pipeline aimed to extract accurate measurements from a large amount of images. Image segmentation is often formulated through a variational principle, where the solution is the minimum of a suitable functional, as in the case of the Ambrosio–Tortorelli model. Euler–Lagrange equations associated with the above model are a system of two coupled elliptic partial differential equations whose finite-difference discretization can be efficiently solved by a generalized relaxation method, such as Jacobi or Gauss–Seidel, corresponding to a first-order alternating minimization scheme. In this work we present a parallel software module for image segmentation based on the Parallel Sparse Basic Linear Algebra Subprograms (PSBLAS), a general-purpose library for parallel sparse matrix computations, using its Graphics Processing Unit (GPU) extensions that allow us to exploit in a simple and transparent way the performance capabilities of both multi-core CPUs and of many-core GPUs. We discuss performance results in terms of execution times and speed-up of the segmentation module running on GPU as well as on multi-core CPUs, in the analysis of 2D gray-scale images of mouse embryonic stem cells colonies coming from biological experiment
Connected component identification and cluster update on GPU
Cluster identification tasks occur in a multitude of contexts in physics and
engineering such as, for instance, cluster algorithms for simulating spin
models, percolation simulations, segmentation problems in image processing, or
network analysis. While it has been shown that graphics processing units (GPUs)
can result in speedups of two to three orders of magnitude as compared to
serial codes on CPUs for the case of local and thus naturally parallelized
problems such as single-spin flip update simulations of spin models, the
situation is considerably more complicated for the non-local problem of cluster
or connected component identification. I discuss the suitability of different
approaches of parallelization of cluster labeling and cluster update algorithms
for calculations on GPU and compare to the performance of serial
implementations.Comment: 15 pages, 14 figures, one table, submitted to PR
Enhancing Effeciency of Ejection Fraction Calculation in the Left Ventricle
The calculation of the cardiac ejection fraction is important for determining whether or not a patient suffers from cardiovascular disease. However, manual calculation of the ejection fraction (EF) is prone to errors and is known to be prohibitively time-consuming. As such, there have been endeavors to automate this process for the sake of saving time as well as improving accuracy of estimation. Recently,GPUhave been proposed to enhance the performance of machine learning algorithms that attempt to estimate the EF. In addition, these algorithms are considered a necessary component in solving computational efficiency issuesencountered in dealing with hugeDigital Imaging and Communications in Medicine (DICOM)datasets. In this study, we useda DICOM dataset of cardiac magnetic resonance imaging for 1200 human cases with different ages and gender to calculate the ejection fraction in the left ventricle.Convolutional Neural Network (CNN) was the selected neural network for the training phase of segmenting the LV and volume calculation. Our target is enhancing efficiencyof CNN to speedup training phase, and subsequently the prediction of the CVDs by experimenting with different GPU-based parallelism techniques, namely Data Parallelism (DP)and Model Parallelism (MP) in addition to the generic use of multiple GPUs. Specifically, we performed four variants of experiments; the first was using GPUs without applying any control on its behavior, the second two variants involve experiments using either DP alone or MP alone on multiple GPUs, while the fourth and final variant involves combining both DP and MP. This was done on Amazon EC2 instances that support up to 8 GPUs per instance. We used two EC2 instances to apply our experiment on 16 GPUs. Our experiments show that our proposed combination of both DP and MP havethe bestcomputational efficiency. Precisely, a speedup of up to 9.88 (over a single GPU) was achieved when using 16 GPUs in parallel with combined DP and MP
{RAMA}: {A} Rapid Multicut Algorithm on {GPU}
We propose a highly parallel primal-dual algorithm for the multicut (a.k.a. correlation clustering) problem, a classical graph clustering problem widely used in machine learning and computer vision. Our algorithm consists of three steps executed recursively: (1) Finding conflicted cycles that correspond to violated inequalities of the underlying multicut relaxation, (2) Performing message passing between the edges and cycles to optimize the Lagrange relaxation coming from the found violated cycles producing reduced costs and (3) Contracting edges with high reduced costs through matrix-matrix multiplications. Our algorithm produces primal solutions and dual lower bounds that estimate the distance to optimum. We implement our algorithm on GPUs and show resulting one to two order-of-magnitudes improvements in execution speed without sacrificing solution quality compared to traditional serial algorithms that run on CPUs. We can solve very large scale benchmark problems with up to variables in a few seconds with small primal-dual gaps. We make our code available at https://github.com/pawelswoboda/RAMA
Doctor of Philosophy
dissertationStochastic methods, dense free-form mapping, atlas construction, and total variation are examples of advanced image processing techniques which are robust but computationally demanding. These algorithms often require a large amount of computational power as well as massive memory bandwidth. These requirements used to be ful lled only by supercomputers. The development of heterogeneous parallel subsystems and computation-specialized devices such as Graphic Processing Units (GPUs) has brought the requisite power to commodity hardware, opening up opportunities for scientists to experiment and evaluate the in uence of these techniques on their research and practical applications. However, harnessing the processing power from modern hardware is challenging. The di fferences between multicore parallel processing systems and conventional models are signi ficant, often requiring algorithms and data structures to be redesigned signi ficantly for efficiency. It also demands in-depth knowledge about modern hardware architectures to optimize these implementations, sometimes on a per-architecture basis. The goal of this dissertation is to introduce a solution for this problem based on a 3D image processing framework, using high performance APIs at the core level to utilize parallel processing power of the GPUs. The design of the framework facilitates an efficient application development process, which does not require scientists to have extensive knowledge about GPU systems, and encourages them to harness this power to solve their computationally challenging problems. To present the development of this framework, four main problems are described, and the solutions are discussed and evaluated: (1) essential components of a general 3D image processing library: data structures and algorithms, as well as how to implement these building blocks on the GPU architecture for optimal performance; (2) an implementation of unbiased atlas construction algorithms|an illustration of how to solve a highly complex and computationally expensive algorithm using this framework; (3) an extension of the framework to account for geometry descriptors to solve registration challenges with large scale shape changes and high intensity-contrast di fferences; and (4) an out-of-core streaming model, which enables developers to implement multi-image processing techniques on commodity hardware
Activity recognition from videos with parallel hypergraph matching on GPUs
In this paper, we propose a method for activity recognition from videos based
on sparse local features and hypergraph matching. We benefit from special
properties of the temporal domain in the data to derive a sequential and fast
graph matching algorithm for GPUs.
Traditionally, graphs and hypergraphs are frequently used to recognize
complex and often non-rigid patterns in computer vision, either through graph
matching or point-set matching with graphs. Most formulations resort to the
minimization of a difficult discrete energy function mixing geometric or
structural terms with data attached terms involving appearance features.
Traditional methods solve this minimization problem approximately, for instance
with spectral techniques.
In this work, instead of solving the problem approximatively, the exact
solution for the optimal assignment is calculated in parallel on GPUs. The
graphical structure is simplified and regularized, which allows to derive an
efficient recursive minimization algorithm. The algorithm distributes
subproblems over the calculation units of a GPU, which solves them in parallel,
allowing the system to run faster than real-time on medium-end GPUs
Pushing the Boundaries of Boundary Detection using Deep Learning
In this work we show that adapting Deep Convolutional Neural Network training
to the task of boundary detection can result in substantial improvements over
the current state-of-the-art in boundary detection.
Our contributions consist firstly in combining a careful design of the loss
for boundary detection training, a multi-resolution architecture and training
with external data to improve the detection accuracy of the current state of
the art. When measured on the standard Berkeley Segmentation Dataset, we
improve theoptimal dataset scale F-measure from 0.780 to 0.808 - while human
performance is at 0.803. We further improve performance to 0.813 by combining
deep learning with grouping, integrating the Normalized Cuts technique within a
deep network.
We also examine the potential of our boundary detector in conjunction with
the task of semantic segmentation and demonstrate clear improvements over
state-of-the-art systems. Our detector is fully integrated in the popular Caffe
framework and processes a 320x420 image in less than a second.Comment: The previous version reported large improvements w.r.t. the LPO
region proposal baseline, which turned out to be due to a wrong computation
for the baseline. The improvements are currently less important, and are
omitted. We are sorry if the reported results caused any confusion. We have
also integrated reviewer feedback regarding human performance on the BSD
benchmar
Sparse approximate inverse preconditioners on high performance GPU platforms
Simulation with models based on partial differential equations often requires the solution of (sequences of) large and sparse algebraic linear systems. In multidimensional domains, preconditioned Krylov iterative solvers are often appropriate for these duties. Therefore, the search for efficient preconditioners for Krylov subspace methods is a crucial theme. Recent developments, especially in computing hardware, have renewed the interest in approximate inverse preconditioners in factorized form, because their application during the solution process can be more efficient. We present here some experiences focused on the approximate inverse preconditioners proposed by Benzi and Tůma from 1996 and the sparsification and inversion proposed by van Duin in 1999. Computational costs, reorderings and implementation issues are considered both on conventional and innovative computing architectures like Graphics Programming Units (GPUs)
- …