9 research outputs found

    EFFICIENT SCHEDULING OF DYNAMIC PROGRAMMING ALGORITHMS ON MULTICORE ARCHITECTURES

    Get PDF
    Dynamic programming is one of the Berkley 13 dwarfs widely used for solving various combinatorial and optimization problems, including matrix chain multiplication, longest common subsequence, binary (0/1) knapsack and so on. Due to nonuniformity in the inherent dependence in dynamic programming algorithms, it becomes necessary to schedule the subproblems of dynamic programming effectively to processing cores for optimal utilization of multicore technology. The computational matrix of dynamic programming is divided into three parts; growing region, stable region and shrinking region depending on whether the number of subproblems increases, remain stable or decreases uniformly phase by phase respectively. We realize the parallel implementations of matrix chain multiplication, longest common subsequence and 0/1 knapsack on Intel Xeon X5650 and E5-2695 using OpenMP with different scheduling policies and adequate chunk sizes. It is concluded that, for the growing or the shrinking region of dynamic programming parallelization adopted in this article, guided schedule is better as compared to other scheduling scheme. Static or dynamic schedule is better for the stable region of dynamic programming. Dynamic programming approach, where all three regions are present, more speedup is achieved by applying the mixed scheduling approach rather than applying only single scheduling technique for the entire computations. In LCS, approximately 20% more speedup is achieved using a mixed scheduling technique over the conventional single scheduling approach on Intel Xeon E5-2695

    D2P: Automatically Creating Distributed Dynamic Programming Codes

    Get PDF
    Dynamic Programming (DP) algorithms are common targets for parallelization, and, as these algorithms are applied to larger inputs, distributed implementations become necessary. However, creating distributed-memory solutions involves the challenges of task creation, program and data partitioning, communication optimization, and task scheduling. In this paper we present D2P, an end-to-end system for automatically transforming a specification of any recursive DP algorithm into distributed-memory implementation of the algorithm. When given a pseudo-code of a recursive DP algorithm, D2P automatically generates the corresponding MPI-based implementation. Our evaluation of the generated distributed implementations shows that they are efficient and scalable. Moreover, D2P-generated implementations are faster than implementations generated by recent general distributed DP frameworks, and are competitive with (and often faster than) hand-written implementations

    Deriving divide-and-conquer dynamic programming algorithms using solver-aided transformations

    Get PDF
    We introduce a framework allowing domain experts to manipulate computational terms in the interest of deriving better, more efficient implementations.It employs deductive reasoning to generate provably correct efficient implementations from a very high-level specification of an algorithm, and inductive constraint-based synthesis to improve automation. Semantic information is encoded into program terms through the use of refinement types. In this paper, we develop the technique in the context of a system called Bellmania that uses solver-aided tactics to derive parallel divide-and-conquer implementations of dynamic programming algorithms that have better locality and are significantly more efficient than traditional loop-based implementations. Bellmania includes a high-level language for specifying dynamic programming algorithms and a calculus that facilitates gradual transformation of these specifications into efficient implementations. These transformations formalize the divide-and conquer technique; a visualization interface helps users to interactively guide the process, while an SMT-based back-end verifies each step and takes care of low-level reasoning required for parallelism. We have used the system to generate provably correct implementations of several algorithms, including some important algorithms from computational biology, and show that the performance is comparable to that of the best manually optimized code.National Science Foundation (U.S.) (CCF-1139056)National Science Foundation (U.S.) (CCF- 1439084)National Science Foundation (U.S.) (CNS-1553510

    Optimization Techniques for Stencil Data Parallel Programs: Methodologies and Applications

    Get PDF
    The optimization of data parallel programs is a challenging open problem. We analyzed in detail the optimization techniques for stencil computations, which are a subset of data parallel computations. Drawing from previous research, we developed a structured model to describe the program transformations. We used this model to compare the different optimizations presented in literature and study the interaction between them

    Analytical cost metrics: days of future past

    Get PDF
    2019 Summer.Includes bibliographical references.Future exascale high-performance computing (HPC) systems are expected to be increasingly heterogeneous, consisting of several multi-core CPUs and a large number of accelerators, special-purpose hardware that will increase the computing power of the system in a very energy-efficient way. Specialized, energy-efficient accelerators are also an important component in many diverse systems beyond HPC: gaming machines, general purpose workstations, tablets, phones and other media devices. With Moore's law driving the evolution of hardware platforms towards exascale, the dominant performance metric (time efficiency) has now expanded to also incorporate power/energy efficiency. This work builds analytical cost models for cost metrics such as time, energy, memory access, and silicon area. These models are used to predict the performance of applications, for performance tuning, and chip design. The idea is to work with domain specific accelerators where analytical cost models can be accurately used for performance optimization. The performance optimization problems are formulated as mathematical optimization problems. This work explores the analytical cost modeling and mathematical optimization approach in a few ways. For stencil applications and GPU architectures, the analytical cost models are developed for execution time as well as energy. The models are used for performance tuning over existing architectures, and are coupled with silicon area models of GPU architectures to generate highly efficient architecture configurations. For matrix chain products, analytical closed form solutions for off-chip data movement are built and used to minimize the total data movement cost of a minimum op count tree
    corecore