5,316 research outputs found
Modular design of data-parallel graph algorithms
Amorphous Data Parallelism has proven to be a suitable vehicle for implementing concurrent graph algorithms effectively on multi-core architectures. In view of the growing complexity of graph algorithms for information analysis, there is a need to facilitate modular design techniques in the context of Amorphous Data Parallelism. In this paper, we investigate what it takes to formulate algorithms possessing Amorphous Data Parallelism in a modular fashion enabling a large degree of code re-use. Using the betweenness centrality algorithm, a widely popular algorithm in the analysis of social networks, we demonstrate that a single optimisation technique can suffice to enable a modular programming style without loosing the efficiency of a tailor-made monolithic implementation
Processor Allocation for Optimistic Parallelization of Irregular Programs
Optimistic parallelization is a promising approach for the parallelization of
irregular algorithms: potentially interfering tasks are launched dynamically,
and the runtime system detects conflicts between concurrent activities,
aborting and rolling back conflicting tasks. However, parallelism in irregular
algorithms is very complex. In a regular algorithm like dense matrix
multiplication, the amount of parallelism can usually be expressed as a
function of the problem size, so it is reasonably straightforward to determine
how many processors should be allocated to execute a regular algorithm of a
certain size (this is called the processor allocation problem). In contrast,
parallelism in irregular algorithms can be a function of input parameters, and
the amount of parallelism can vary dramatically during the execution of the
irregular algorithm. Therefore, the processor allocation problem for irregular
algorithms is very difficult.
In this paper, we describe the first systematic strategy for addressing this
problem. Our approach is based on a construct called the conflict graph, which
(i) provides insight into the amount of parallelism that can be extracted from
an irregular algorithm, and (ii) can be used to address the processor
allocation problem for irregular algorithms. We show that this problem is
related to a generalization of the unfriendly seating problem and, by extending
Tur\'an's theorem, we obtain a worst-case class of problems for optimistic
parallelization, which we use to derive a lower bound on the exploitable
parallelism. Finally, using some theoretically derived properties and some
experimental facts, we design a quick and stable control strategy for solving
the processor allocation problem heuristically.Comment: 12 pages, 3 figures, extended version of SPAA 2011 brief announcemen
Experimental Progress in Computation by Self-Assembly of DNA Tilings
Approaches to DNA-based computing by self-assembly require the
use of D. T A nanostructures, called tiles, that have efficient chemistries, expressive
computational power: and convenient input and output (I/O) mechanisms.
We have designed two new classes of DNA tiles: TAO and TAE, both
of which contain three double-helices linked by strand exchange. Structural
analysis of a TAO molecule has shown that the molecule assembles efficiently
from its four component strands. Here we demonstrate a novel method for
I/O whereby multiple tiles assemble around a single-stranded (input) scaffold
strand. Computation by tiling theoretically results in the formation of structures
that contain single-stranded (output) reported strands, which can then
be isolated for subsequent steps of computation if necessary. We illustrate the
advantages of TAO and TAE designs by detailing two examples of massively
parallel arithmetic: construction of complete XOR and addition tables by linear
assemblies of DNA tiles. The three helix structures provide flexibility for
topological routing of strands in the computation: allowing the implementation
of string tile models
Distributed-memory large deformation diffeomorphic 3D image registration
We present a parallel distributed-memory algorithm for large deformation
diffeomorphic registration of volumetric images that produces large isochoric
deformations (locally volume preserving). Image registration is a key
technology in medical image analysis. Our algorithm uses a partial differential
equation constrained optimal control formulation. Finding the optimal
deformation map requires the solution of a highly nonlinear problem that
involves pseudo-differential operators, biharmonic operators, and pure
advection operators both forward and back- ward in time. A key issue is the
time to solution, which poses the demand for efficient optimization methods as
well as an effective utilization of high performance computing resources. To
address this problem we use a preconditioned, inexact, Gauss-Newton- Krylov
solver. Our algorithm integrates several components: a spectral discretization
in space, a semi-Lagrangian formulation in time, analytic adjoints, different
regularization functionals (including volume-preserving ones), a spectral
preconditioner, a highly optimized distributed Fast Fourier Transform, and a
cubic interpolation scheme for the semi-Lagrangian time-stepping. We
demonstrate the scalability of our algorithm on images with resolution of up to
on the "Maverick" and "Stampede" systems at the Texas Advanced
Computing Center (TACC). The critical problem in the medical imaging
application domain is strong scaling, that is, solving registration problems of
a moderate size of ---a typical resolution for medical images. We are
able to solve the registration problem for images of this size in less than
five seconds on 64 x86 nodes of TACC's "Maverick" system.Comment: accepted for publication at SC16 in Salt Lake City, Utah, USA;
November 201
Automatically configuring parallelism for hybrid layouts
Distributed processing frameworks process data in parallel by dividing it into multiple partitions and each partition is processed in a separate task. The number of tasks is always created based on the total file size. However, this can lead to launch more tasks than needed in the case of hybrid layouts, because they help to read less data for certain operations (i.e., projection, selection). The over-provisioning of tasks may increase the job execution time and induce significant waste of computing resources. The latter due to the fact that each task introduces extra overhead (e.g., initialization, garbage collection, etc.).
To allow a more efficient use of resources and reduce the job execution time, we propose a cost-based approach that decides the number of tasks based on the data being read. The proposed cost-model can be utilized in a multi-objective approach to decide both the number of tasks and number of machines for execution.Peer ReviewedPostprint (author's final draft
- …