955 research outputs found
Extending the Nested Parallel Model to the Nested Dataflow Model with Provably Efficient Schedulers
The nested parallel (a.k.a. fork-join) model is widely used for writing
parallel programs. However, the two composition constructs, i.e. ""
(parallel) and "" (serial), are insufficient in expressing "partial
dependencies" or "partial parallelism" in a program. We propose a new dataflow
composition construct "" to express partial dependencies in
algorithms in a processor- and cache-oblivious way, thus extending the Nested
Parallel (NP) model to the \emph{Nested Dataflow} (ND) model. We redesign
several divide-and-conquer algorithms ranging from dense linear algebra to
dynamic-programming in the ND model and prove that they all have optimal span
while retaining optimal cache complexity. We propose the design of runtime
schedulers that map ND programs to multicore processors with multiple levels of
possibly shared caches (i.e, Parallel Memory Hierarchies) and provide
theoretical guarantees on their ability to preserve locality and load balance.
For this, we adapt space-bounded (SB) schedulers for the ND model. We show that
our algorithms have increased "parallelizability" in the ND model, and that SB
schedulers can use the extra parallelizability to achieve asymptotically
optimal bounds on cache misses and running time on a greater number of
processors than in the NP model. The running time for the algorithms in this
paper is , where is the cache complexity of task ,
is the cost of cache miss at level- cache which is of size ,
is a constant, and is the number of processors in an
-level cache hierarchy
Parallelization of dynamic programming recurrences in computational biology
The rapid growth of biosequence databases over the last decade has led to a performance bottleneck in the applications analyzing them. In particular, over the last five years DNA sequencing capacity of next-generation sequencers has been doubling every six months as costs have plummeted. The data produced by these sequencers is overwhelming traditional compute systems. We believe that in the future compute performance, not sequencing, will become the bottleneck in advancing genome science. In this work, we investigate novel computing platforms to accelerate dynamic programming algorithms, which are popular in bioinformatics workloads. We study algorithm-specific hardware architectures that exploit fine-grained parallelism in dynamic programming kernels using field-programmable gate arrays: FPGAs). We advocate a high-level synthesis approach, using the recurrence equation abstraction to represent dynamic programming and polyhedral analysis to exploit parallelism. We suggest a novel technique within the polyhedral model to optimize for throughput by pipelining independent computations on an array. This design technique improves on the state of the art, which builds latency-optimal arrays. We also suggest a method to dynamically switch between a family of designs using FPGA reconfiguration to achieve a significant performance boost. We have used polyhedral methods to parallelize the Nussinov RNA folding algorithm to build a family of accelerators that can trade resources for parallelism and are between 15-130x faster than a modern dual core CPU implementation. A Zuker RNA folding accelerator we built on a single workstation with four Xilinx Virtex 4 FPGAs outperforms 198 3 GHz Intel Core 2 Duo processors. Furthermore, our design running on a single FPGA is an order of magnitude faster than competing implementations on similar-generation FPGAs and graphics processors. Our work is a step toward the goal of automated synthesis of hardware accelerators for dynamic programming algorithms
Compiling Recurrences over Dense and Sparse Arrays
Recurrence equations lie at the heart of many computational paradigms
including dynamic programming, graph analysis, and linear solvers. These
equations are often expensive to compute and much work has gone into optimizing
them for different situations. The set of recurrence implementations is a large
design space across the set of all recurrences (e.g., the Viterbi and
Floyd-Warshall algorithms), the choice of data structures (e.g., dense and
sparse matrices), and the set of different loop orders. Optimized library
implementations do not exist for most points in this design space, and
developers must therefore often manually implement and optimize recurrences. We
present a general framework for compiling recurrence equations into native code
corresponding to any valid point in this general design space. In this
framework, users specify a system of recurrences, the type of data structures
for storing the input and outputs, and a set of scheduling primitives for
optimization. A greedy algorithm then takes this specification and lowers it
into a native program that respects the dependencies inherent to the recurrence
equation. We describe the compiler transformations necessary to lower this
high-level specification into native parallel code for either sparse and dense
data structures and provide an algorithm for determining whether the recurrence
system is solvable with the provided scheduling primitives. We evaluate the
performance and correctness of the generated code on various computational
tasks from domains including dense and sparse matrix solvers, dynamic
programming, graph problems, and sparse tensor algebra. We demonstrate that
generated code has competitive performance to handwritten implementations in
libraries
Polymorphic dynamic programming by algebraic shortcut fusion
Dynamic programming (DP) is a broadly applicable algorithmic design paradigm
for the efficient, exact solution of otherwise intractable, combinatorial
problems. However, the design of such algorithms is often presented informally
in an ad-hoc manner, and as a result is often difficult to apply correctly. In
this paper, we present a rigorous algebraic formalism for systematically
deriving novel DP algorithms, either from existing DP algorithms or from simple
functional recurrences. These derivations lead to algorithms which are provably
correct and polymorphic over any semiring, which means that they can be applied
to the full scope of combinatorial problems expressible in terms of semirings.
This includes, for example: optimization, optimal probability and Viterbi
decoding, probabilistic marginalization, logical inference, fuzzy sets,
differentiable softmax, and relational and provenance queries. The approach,
building on many ideas from the existing literature on constructive
algorithmics, exploits generic properties of (semiring) polymorphic functions,
tupling and formal sums (lifting), and algebraic simplifications arising from
constraint algebras. We demonstrate the effectiveness of this formalism for
some example applications arising in signal processing, bioinformatics and
reliability engineering.Comment: Updated v9 with 2 additional figures and description
D2P: Automatically Creating Distributed Dynamic Programming Codes
Dynamic Programming (DP) algorithms are common targets for parallelization, and, as these algorithms are applied to larger inputs, distributed implementations become necessary. However, creating distributed-memory solutions involves the challenges of task creation, program and data partitioning, communication optimization, and task scheduling. In this paper we present D2P, an end-to-end system for automatically transforming a specification of any recursive DP algorithm into distributed-memory implementation of the algorithm. When given a pseudo-code of a recursive DP algorithm, D2P automatically generates the corresponding MPI-based implementation. Our evaluation of the generated distributed implementations shows that they are efficient and scalable. Moreover, D2P-generated implementations are faster than implementations generated by recent general distributed DP frameworks, and are competitive with (and often faster than) hand-written implementations
Compiling a domain specific language for dynamic programming
Steffen P. Compiling a domain specific language for dynamic programming. Bielefeld (Germany): Bielefeld University; 2006
- …