4,771 research outputs found
A Parallel Algorithm for Exact Bayesian Structure Discovery in Bayesian Networks
Exact Bayesian structure discovery in Bayesian networks requires exponential
time and space. Using dynamic programming (DP), the fastest known sequential
algorithm computes the exact posterior probabilities of structural features in
time and space, if the number of nodes (variables) in the
Bayesian network is and the in-degree (the number of parents) per node is
bounded by a constant . Here we present a parallel algorithm capable of
computing the exact posterior probabilities for all edges with optimal
parallel space efficiency and nearly optimal parallel time efficiency. That is,
if processors are used, the run-time reduces to
and the space usage becomes per
processor. Our algorithm is based the observation that the subproblems in the
sequential DP algorithm constitute a - hypercube. We take a delicate way
to coordinate the computation of correlated DP procedures such that large
amount of data exchange is suppressed. Further, we develop parallel techniques
for two variants of the well-known \emph{zeta transform}, which have
applications outside the context of Bayesian networks. We demonstrate the
capability of our algorithm on datasets with up to 33 variables and its
scalability on up to 2048 processors. We apply our algorithm to a biological
data set for discovering the yeast pheromone response pathways.Comment: 32 pages, 12 figure
Extending the Nested Parallel Model to the Nested Dataflow Model with Provably Efficient Schedulers
The nested parallel (a.k.a. fork-join) model is widely used for writing
parallel programs. However, the two composition constructs, i.e. ""
(parallel) and "" (serial), are insufficient in expressing "partial
dependencies" or "partial parallelism" in a program. We propose a new dataflow
composition construct "" to express partial dependencies in
algorithms in a processor- and cache-oblivious way, thus extending the Nested
Parallel (NP) model to the \emph{Nested Dataflow} (ND) model. We redesign
several divide-and-conquer algorithms ranging from dense linear algebra to
dynamic-programming in the ND model and prove that they all have optimal span
while retaining optimal cache complexity. We propose the design of runtime
schedulers that map ND programs to multicore processors with multiple levels of
possibly shared caches (i.e, Parallel Memory Hierarchies) and provide
theoretical guarantees on their ability to preserve locality and load balance.
For this, we adapt space-bounded (SB) schedulers for the ND model. We show that
our algorithms have increased "parallelizability" in the ND model, and that SB
schedulers can use the extra parallelizability to achieve asymptotically
optimal bounds on cache misses and running time on a greater number of
processors than in the NP model. The running time for the algorithms in this
paper is , where is the cache complexity of task ,
is the cost of cache miss at level- cache which is of size ,
is a constant, and is the number of processors in an
-level cache hierarchy
Managing Communication Latency-Hiding at Runtime for Parallel Programming Languages and Libraries
This work introduces a runtime model for managing communication with support
for latency-hiding. The model enables non-computer science researchers to
exploit communication latency-hiding techniques seamlessly. For compiled
languages, it is often possible to create efficient schedules for
communication, but this is not the case for interpreted languages. By
maintaining data dependencies between scheduled operations, it is possible to
aggressively initiate communication and lazily evaluate tasks to allow maximal
time for the communication to finish before entering a wait state. We implement
a heuristic of this model in DistNumPy, an auto-parallelizing version of
numerical Python that allows sequential NumPy programs to run on distributed
memory architectures. Furthermore, we present performance comparisons for eight
benchmarks with and without automatic latency-hiding. The results shows that
our model reduces the time spent on waiting for communication as much as 27
times, from a maximum of 54% to only 2% of the total execution time, in a
stencil application.Comment: PREPRIN
Hybrid static/dynamic scheduling for already optimized dense matrix factorization
We present the use of a hybrid static/dynamic scheduling strategy of the task
dependency graph for direct methods used in dense numerical linear algebra.
This strategy provides a balance of data locality, load balance, and low
dequeue overhead. We show that the usage of this scheduling in communication
avoiding dense factorization leads to significant performance gains. On a 48
core AMD Opteron NUMA machine, our experiments show that we can achieve up to
64% improvement over a version of CALU that uses fully dynamic scheduling, and
up to 30% improvement over the version of CALU that uses fully static
scheduling. On a 16-core Intel Xeon machine, our hybrid static/dynamic
scheduling approach is up to 8% faster than the version of CALU that uses a
fully static scheduling or fully dynamic scheduling. Our algorithm leads to
speedups over the corresponding routines for computing LU factorization in well
known libraries. On the 48 core AMD NUMA machine, our best implementation is up
to 110% faster than MKL, while on the 16 core Intel Xeon machine, it is up to
82% faster than MKL. Our approach also shows significant speedups compared with
PLASMA on both of these systems
Parallel Algorithms for Bayesian Networks Structure Learning with Applications in Systems Biology
The expression levels of thousands to tens of thousands of genes in a living cell are controlled by internal and external cues which act in a combinatorial manner that can be modeled as a network. High-throughput technologies, such as DNA-microarrays and next generation sequencing, allow for the measurement of gene expression levels on a whole-genome scale. In recent years, a wealth of microarray data probing gene expression under various biological conditions has been accumulated in public repositories, which facilitates uncovering the underlying transcriptional networks (gene networks). Due to the high data dimensionality and inherent complexity of gene interactions, this task inevitably requires automated computational approaches.
Various models have been proposed for learning gene networks, with Bayesian networks (BNs) showing promise for the task. However, BN structure learning is an NP-hard problem and both exact and heuristic methods are computationally intensive with limited ability to produce large networks. To address these issues, we developed a set of parallel algorithms. First, we present a communication efficient parallel algorithm for exact BN structure learning, which is work-optimal provided that 2^n \u3e p.log(p), where n is the total number of variables, and p is the number of processors. This algorithm has space complexity within 1.41 of the optimal. Our empirical results demonstrate near perfect scaling on up to 2,048 processors. We further extend this work to the case of bounded node in-degree, where a limit d on the number of parents per variable is imposed. We characterize the algorithm\u27s run-time behavior as a function of d, establishing the range [n/3 - log(mn), ceil(n/2)) of values for d where it affects performance. Consequently, two plateaus regions are identified: for d \u3c n/3 - log(mn), where the run-time complexity remains the same as for d=1, and for d \u3e= ceil(n/2), where the run-time complexity remains the same as for d=n-1. Finally, we present a parallel heuristic approach for large-scale BN learning. This approach aims to combine the precision of exact learning with the scalability of heuristic methods. Our empirical results demonstrate good scaling on various high performance platforms. The quality of the learned networks for both exact and heuristic methods are evaluated using synthetically generated expression data. The biological relevance of the networks learned by the exact algorithm is assessed by applying it to the carotenoid biosynthesis pathway in Arabidopsis thaliana
- …