3,099 research outputs found
On Characterizing the Data Movement Complexity of Computational DAGs for Parallel Execution
Technology trends are making the cost of data movement increasingly dominant,
both in terms of energy and time, over the cost of performing arithmetic
operations in computer systems. The fundamental ratio of aggregate data
movement bandwidth to the total computational power (also referred to the
machine balance parameter) in parallel computer systems is decreasing. It is
there- fore of considerable importance to characterize the inherent data
movement requirements of parallel algorithms, so that the minimal architectural
balance parameters required to support it on future systems can be well
understood. In this paper, we develop an extension of the well-known red-blue
pebble game to develop lower bounds on the data movement complexity for the
parallel execution of computational directed acyclic graphs (CDAGs) on parallel
systems. We model multi-node multi-core parallel systems, with the total
physical memory distributed across the nodes (that are connected through some
interconnection network) and in a multi-level shared cache hierarchy for
processors within a node. We also develop new techniques for lower bound
characterization of non-homogeneous CDAGs. We demonstrate the use of the
methodology by analyzing the CDAGs of several numerical algorithms, to develop
lower bounds on data movement for their parallel execution
On Characterizing the Data Access Complexity of Programs
Technology trends will cause data movement to account for the majority of
energy expenditure and execution time on emerging computers. Therefore,
computational complexity will no longer be a sufficient metric for comparing
algorithms, and a fundamental characterization of data access complexity will
be increasingly important. The problem of developing lower bounds for data
access complexity has been modeled using the formalism of Hong & Kung's
red/blue pebble game for computational directed acyclic graphs (CDAGs).
However, previously developed approaches to lower bounds analysis for the
red/blue pebble game are very limited in effectiveness when applied to CDAGs of
real programs, with computations comprised of multiple sub-computations with
differing DAG structure. We address this problem by developing an approach for
effectively composing lower bounds based on graph decomposition. We also
develop a static analysis algorithm to derive the asymptotic data-access lower
bounds of programs, as a function of the problem size and cache size
A Lower Bound Technique for Communication in BSP
Communication is a major factor determining the performance of algorithms on
current computing systems; it is therefore valuable to provide tight lower
bounds on the communication complexity of computations. This paper presents a
lower bound technique for the communication complexity in the bulk-synchronous
parallel (BSP) model of a given class of DAG computations. The derived bound is
expressed in terms of the switching potential of a DAG, that is, the number of
permutations that the DAG can realize when viewed as a switching network. The
proposed technique yields tight lower bounds for the fast Fourier transform
(FFT), and for any sorting and permutation network. A stronger bound is also
derived for the periodic balanced sorting network, by applying this technique
to suitable subnetworks. Finally, we demonstrate that the switching potential
captures communication requirements even in computational models different from
BSP, such as the I/O model and the LPRAM
Query DAGs: A Practical Paradigm for Implementing Belief-Network Inference
We describe a new paradigm for implementing inference in belief networks,
which consists of two steps: (1) compiling a belief network into an arithmetic
expression called a Query DAG (Q-DAG); and (2) answering queries using a simple
evaluation algorithm. Each node of a Q-DAG represents a numeric operation, a
number, or a symbol for evidence. Each leaf node of a Q-DAG represents the
answer to a network query, that is, the probability of some event of interest.
It appears that Q-DAGs can be generated using any of the standard algorithms
for exact inference in belief networks (we show how they can be generated using
clustering and conditioning algorithms). The time and space complexity of a
Q-DAG generation algorithm is no worse than the time complexity of the
inference algorithm on which it is based. The complexity of a Q-DAG evaluation
algorithm is linear in the size of the Q-DAG, and such inference amounts to a
standard evaluation of the arithmetic expression it represents. The intended
value of Q-DAGs is in reducing the software and hardware resources required to
utilize belief networks in on-line, real-world applications. The proposed
framework also facilitates the development of on-line inference on different
software and hardware platforms due to the simplicity of the Q-DAG evaluation
algorithm. Interestingly enough, Q-DAGs were found to serve other purposes:
simple techniques for reducing Q-DAGs tend to subsume relatively complex
optimization techniques for belief-network inference, such as network-pruning
and computation-caching.Comment: See http://www.jair.org/ for any accompanying file
Partition MCMC for inference on acyclic digraphs
Acyclic digraphs are the underlying representation of Bayesian networks, a
widely used class of probabilistic graphical models. Learning the underlying
graph from data is a way of gaining insights about the structural properties of
a domain. Structure learning forms one of the inference challenges of
statistical graphical models.
MCMC methods, notably structure MCMC, to sample graphs from the posterior
distribution given the data are probably the only viable option for Bayesian
model averaging. Score modularity and restrictions on the number of parents of
each node allow the graphs to be grouped into larger collections, which can be
scored as a whole to improve the chain's convergence. Current examples of
algorithms taking advantage of grouping are the biased order MCMC, which acts
on the alternative space of permuted triangular matrices, and non ergodic edge
reversal moves.
Here we propose a novel algorithm, which employs the underlying combinatorial
structure of DAGs to define a new grouping. As a result convergence is improved
compared to structure MCMC, while still retaining the property of producing an
unbiased sample. Finally the method can be combined with edge reversal moves to
improve the sampler further.Comment: Revised version. 34 pages, 16 figures. R code available at
https://github.com/annlia/partitionMCM
Quantum Algorithm for Dynamic Programming Approach for DAGs. Applications for Zhegalkin Polynomial Evaluation and Some Problems on DAGs
In this paper, we present a quantum algorithm for dynamic programming
approach for problems on directed acyclic graphs (DAGs). The running time of
the algorithm is , and the running time of the
best known deterministic algorithm is , where is the number of
vertices, is the number of vertices with at least one outgoing edge;
is the number of edges. We show that we can solve problems that use OR,
AND, NAND, MAX and MIN functions as the main transition steps. The approach is
useful for a couple of problems. One of them is computing a Boolean formula
that is represented by Zhegalkin polynomial, a Boolean circuit with shared
input and non-constant depth evaluating. Another two are the single source
longest paths search for weighted DAGs and the diameter search problem for
unweighted DAGs.Comment: UCNC2019 Conference pape
Parallel algorithms and concentration bounds for the Lovasz Local Lemma via witness DAGs
The Lov\'{a}sz Local Lemma (LLL) is a cornerstone principle in the
probabilistic method of combinatorics, and a seminal algorithm of Moser &
Tardos (2010) provides an efficient randomized algorithm to implement it. This
can be parallelized to give an algorithm that uses polynomially many processors
and runs in time on an EREW PRAM, stemming from
adaptive computations of a maximal independent set (MIS). Chung et al. (2014)
developed faster local and parallel algorithms, potentially running in time
, but these algorithms require more stringent conditions than the
LLL.
We give a new parallel algorithm that works under essentially the same
conditions as the original algorithm of Moser & Tardos but uses only a single
MIS computation, thus running in time on an EREW PRAM. This can
be derandomized to give an NC algorithm running in time as well,
speeding up a previous NC LLL algorithm of Chandrasekaran et al. (2013).
We also provide improved and tighter bounds on the run-times of the
sequential and parallel resampling-based algorithms originally developed by
Moser & Tardos. These apply to any problem instance in which the tighter
Shearer LLL criterion is satisfied
Extending the Nested Parallel Model to the Nested Dataflow Model with Provably Efficient Schedulers
The nested parallel (a.k.a. fork-join) model is widely used for writing
parallel programs. However, the two composition constructs, i.e. ""
(parallel) and "" (serial), are insufficient in expressing "partial
dependencies" or "partial parallelism" in a program. We propose a new dataflow
composition construct "" to express partial dependencies in
algorithms in a processor- and cache-oblivious way, thus extending the Nested
Parallel (NP) model to the \emph{Nested Dataflow} (ND) model. We redesign
several divide-and-conquer algorithms ranging from dense linear algebra to
dynamic-programming in the ND model and prove that they all have optimal span
while retaining optimal cache complexity. We propose the design of runtime
schedulers that map ND programs to multicore processors with multiple levels of
possibly shared caches (i.e, Parallel Memory Hierarchies) and provide
theoretical guarantees on their ability to preserve locality and load balance.
For this, we adapt space-bounded (SB) schedulers for the ND model. We show that
our algorithms have increased "parallelizability" in the ND model, and that SB
schedulers can use the extra parallelizability to achieve asymptotically
optimal bounds on cache misses and running time on a greater number of
processors than in the NP model. The running time for the algorithms in this
paper is , where is the cache complexity of task ,
is the cost of cache miss at level- cache which is of size ,
is a constant, and is the number of processors in an
-level cache hierarchy
Uniform random generation of large acyclic digraphs
Directed acyclic graphs are the basic representation of the structure
underlying Bayesian networks, which represent multivariate probability
distributions. In many practical applications, such as the reverse engineering
of gene regulatory networks, not only the estimation of model parameters but
the reconstruction of the structure itself is of great interest. As well as for
the assessment of different structure learning algorithms in simulation
studies, a uniform sample from the space of directed acyclic graphs is required
to evaluate the prevalence of certain structural features. Here we analyse how
to sample acyclic digraphs uniformly at random through recursive enumeration,
an approach previously thought too computationally involved. Based on
complexity considerations, we discuss in particular how the enumeration
directly provides an exact method, which avoids the convergence issues of the
alternative Markov chain methods and is actually computationally much faster.
The limiting behaviour of the distribution of acyclic digraphs then allows us
to sample arbitrarily large graphs. Building on the ideas of recursive
enumeration based sampling we also introduce a novel hybrid Markov chain with
much faster convergence than current alternatives while still being easy to
adapt to various restrictions. Finally we discuss how to include such
restrictions in the combinatorial enumeration and the new hybrid Markov chain
method for efficient uniform sampling of the corresponding graphs.Comment: 15 pages, 2 figures. To appear in Statistics and Computin
- …