195 research outputs found

    Deriving divide-and-conquer dynamic programming algorithms using solver-aided transformations

    Get PDF
    We introduce a framework allowing domain experts to manipulate computational terms in the interest of deriving better, more efficient implementations.It employs deductive reasoning to generate provably correct efficient implementations from a very high-level specification of an algorithm, and inductive constraint-based synthesis to improve automation. Semantic information is encoded into program terms through the use of refinement types. In this paper, we develop the technique in the context of a system called Bellmania that uses solver-aided tactics to derive parallel divide-and-conquer implementations of dynamic programming algorithms that have better locality and are significantly more efficient than traditional loop-based implementations. Bellmania includes a high-level language for specifying dynamic programming algorithms and a calculus that facilitates gradual transformation of these specifications into efficient implementations. These transformations formalize the divide-and conquer technique; a visualization interface helps users to interactively guide the process, while an SMT-based back-end verifies each step and takes care of low-level reasoning required for parallelism. We have used the system to generate provably correct implementations of several algorithms, including some important algorithms from computational biology, and show that the performance is comparable to that of the best manually optimized code.National Science Foundation (U.S.) (CCF-1139056)National Science Foundation (U.S.) (CCF- 1439084)National Science Foundation (U.S.) (CNS-1553510

    Hybrid analysis of memory references and its application to automatic parallelization

    Get PDF
    Executing sequential code in parallel on a multithreaded machine has been an elusive goal of the academic and industrial research communities for many years. It has recently become more important due to the widespread introduction of multicores in PCs. Automatic multithreading has not been achieved because classic, static compiler analysis was not powerful enough and program behavior was found to be, in many cases, input dependent. Speculative thread level parallelization was a welcome avenue for advancing parallelization coverage but its performance was not always optimal due to the sometimes unnecessary overhead of checking every dynamic memory reference. In this dissertation we introduce a novel analysis technique, Hybrid Analysis, which unifies static and dynamic memory reference techniques into a seamless compiler framework which extracts almost maximum available parallelism from scientific codes and incurs close to the minimum necessary run time overhead. We present how to extract maximum information from the quantities that could not be sufficiently analyzed through static compiler methods, and how to generate sufficient conditions which, when evaluated dynamically, can validate optimizations. Our techniques have been fully implemented in the Polaris compiler and resulted in whole program speedups on a large number of industry standard benchmark applications

    æœšă‚’ç”šă„ăŸæ§‹é€ ćŒ–äžŠćˆ—ăƒ—ăƒ­ă‚°ăƒ©ăƒŸăƒłă‚°

    Get PDF
    High-level abstractions for parallel programming are still immature. Computations on complicated data structures such as pointer structures are considered as irregular algorithms. General graph structures, which irregular algorithms generally deal with, are difficult to divide and conquer. Because the divide-and-conquer paradigm is essential for load balancing in parallel algorithms and a key to parallel programming, general graphs are reasonably difficult. However, trees lead to divide-and-conquer computations by definition and are sufficiently general and powerful as a tool of programming. We therefore deal with abstractions of tree-based computations. Our study has started from Matsuzaki’s work on tree skeletons. We have improved the usability of tree skeletons by enriching their implementation aspect. Specifically, we have dealt with two issues. We first have implemented the loose coupling between skeletons and data structures and developed a flexible tree skeleton library. We secondly have implemented a parallelizer that transforms sequential recursive functions in C into parallel programs that use tree skeletons implicitly. This parallelizer hides the complicated API of tree skeletons and makes programmers to use tree skeletons with no burden. Unfortunately, the practicality of tree skeletons, however, has not been improved. On the basis of the observations from the practice of tree skeletons, we deal with two application domains: program analysis and neighborhood computation. In the domain of program analysis, compilers treat input programs as control-flow graphs (CFGs) and perform analysis on CFGs. Program analysis is therefore difficult to divide and conquer. To resolve this problem, we have developed divide-and-conquer methods for program analysis in a syntax-directed manner on the basis of Rosen’s high-level approach. Specifically, we have dealt with data-flow analysis based on Tarjan’s formalization and value-graph construction based on a functional formalization. In the domain of neighborhood computations, a primary issue is locality. A naive parallel neighborhood computation without locality enhancement causes a lot of cache misses. The divide-and-conquer paradigm is known to be useful also for locality enhancement. We therefore have applied algebraic formalizations and a tree-segmenting technique derived from tree skeletons to the locality enhancement of neighborhood computations.é›»æ°—é€šäżĄć€§ć­Š201

    A bibliography on parallel and vector numerical algorithms

    Get PDF
    This is a bibliography of numerical methods. It also includes a number of other references on machine architecture, programming language, and other topics of interest to scientific computing. Certain conference proceedings and anthologies which have been published in book form are listed also

    Staged parser combinators for efficient data processing

    Get PDF
    Parsers are ubiquitous in computing, and many applications depend on their performance for decoding data efficiently. Parser combinators are an intuitive tool for writing parsers: tight integration with the host language enables grammar specifications to be interleaved with processing of parse results. Unfortunately, parser combinators are typically slow due to the high overhead of the host language abstraction mechanisms that enable composition. We present a technique for eliminating such overhead. We use staging, a form of runtime code generation, to dissociate input parsing from parser composition, and eliminate intermediate data structures and computations associated with parser composition at staging time. A key challenge is to maintain support for input dependent grammars, which have no clear stage distinction. Our approach applies to top-down recursive-descent parsers as well as bottom-up nondeterministic parsers with key applications in dynamic programming on sequences, where we auto-generate code for parallel hardware. We achieve performance comparable to specialized, hand-written parsers

    Improved static analysis and verification of energy consumption and other resources via abstract interpretation

    Get PDF
    Resource analysis aims at inferring the cost of executing programs for any possible input, in terms of a given resource, such as the traditional execution steps, time ormemory, and, more recently energy consumption or user defined resources (e.g., number of bits sent over a socket, number of database accesses, number of calls to particular procedures, etc.). This is performed statically, i.e., without actually running the programs. Resource usage information is useful for a variety of optimization and verification applications, as well as for guiding software design. For example, programmers can use such information to choose different algorithmic solutions to a problem; program transformation systems can use cost information to choose between alternative transformations; parallelizing compilers can use cost estimates for granularity control, which tries to balance the overheads of task creation and manipulation against the benefits of parallelization. In this thesis we have significatively improved an existing prototype implementation for resource usage analysis based on abstract interpretation, addressing a number of relevant challenges and overcoming many limitations it presented. The goal of that prototype was to show the viability of casting the resource analysis as an abstract domain, and howit could overcome important limitations of the state-of-the-art resource usage analysis tools. For this purpose, it was implemented as an abstract domain in the abstract interpretation framework of the CiaoPP system, PLAI.We have improved both the design and implementation of the prototype, for eventually allowing an evolution of the tool to the industrial application level. The abstract operations of such tool heavily depend on the setting up and finding closed-form solutions of recurrence relations representing the resource usage behavior of program components and the whole program as well. While there exist many tools, such as Computer Algebra Systems (CAS) and libraries able to find closed-form solutions for some types of recurrences, none of them alone is able to handle all the types of recurrences arising during program analysis. In addition, there are some types of recurrences that cannot be solved by any existing tool. This clearly constitutes a bottleneck for this kind of resource usage analysis. Thus, one of the major challenges we have addressed in this thesis is the design and development of a novel modular framework for solving recurrence relations, able to combine and take advantage of the results of existing solvers. Additionally, we have developed and integrated into our novel solver a technique for finding upper-bound closed-form solutions of a special class of recurrence relations that arise during the analysis of programs with accumulating parameters. Finally, we have integrated the improved resource analysis into the CiaoPP general framework for resource usage verification, and specialized the framework for verifying energy consumption specifications of embedded imperative programs in a real application, showing the usefulness and practicality of the resulting tool.---ABSTRACT---El AnĂĄlisis de recursos tiene como objetivo inferir el coste de la ejecuciĂłn de programas para cualquier entrada posible, en tĂ©rminos de algĂșn recurso determinado, como pasos de ejecuciĂłn, tiempo o memoria, y, mĂĄs recientemente, el consumo de energĂ­a o recursos definidos por el usuario (por ejemplo, nĂșmero de bits enviados a travĂ©s de un socket, el nĂșmero de accesos a una base de datos, cantidad de llamadas a determinados procedimientos, etc.). Ello se realiza estĂĄticamente, es decir, sin necesidad de ejecutar los programas. La informaciĂłn sobre el uso de recursos resulta muy Ăștil para una gran variedad de aplicaciones de optimizaciĂłn y verificaciĂłn de programas, asĂ­ como para asistir en el diseño de los mismos. Por ejemplo, los programadores pueden utilizar dicha informaciĂłn para elegir diferentes soluciones algorĂ­tmicas a un problema; los sistemas de transformaciĂłn de programas pueden utilizar la informaciĂłn de coste para elegir entre transformaciones alternativas; los compiladores paralelizantes pueden utilizar las estimaciones de coste para realizar control de granularidad, el cual trata de equilibrar el coste debido a la creaciĂłn y gestiĂłn de tareas, con los beneficios de la paralelizaciĂłn. En esta tesis hemos mejorado de manera significativa la implementaciĂłn de un prototipo existente para el anĂĄlisis del uso de recursos basado en interpretaciĂłn abstracta, abordando diversos desafĂ­os relevantes y superando numerosas limitaciones que Ă©ste presentaba. El objetivo de dicho prototipo era mostrar la viabilidad de definir el anĂĄlisis de recursos como un dominio abstracto, y cĂłmo se podĂ­an superar las limitaciones de otras herramientas similares que constituyen el estado del arte. Para ello, se implementĂł como un dominio abstracto en el marco de interpretaciĂłn abstracta presente en el sistema CiaoPP, PLAI. Hemos mejorado tanto el diseño como la implementaciĂłn del mencionado prototipo para posibilitar su evoluciĂłn hacia una herramienta utilizable en el ĂĄmbito industrial. Las operaciones abstractas de dicha herramienta dependen en gran medida de la generaciĂłn, y posterior bĂșsqueda de soluciones en forma cerrada, de relaciones recurrentes, las cuales modelizan el comportamiento, respecto al consumo de recursos, de los componentes del programa y del programa completo. Si bien existen actualmente muchas herramientas capaces de encontrar soluciones en forma cerrada para ciertos tipos de recurrencias, tales como Sistemas de ComputaciĂłn Algebraicos (CAS) y librerĂ­as de programaciĂłn, ninguna de dichas herramientas es capaz de tratar, por sĂ­ sola, todos los tipos de recurrencias que surgen durante el anĂĄlisis de recursos. Existen incluso recurrencias que no las puede resolver ninguna herramienta actual. Esto constituye claramente un cuello de botella para este tipo de anĂĄlisis del uso de recursos. Por lo tanto, uno de los principales desafĂ­os que hemos abordado en esta tesis es el diseño y desarrollo de un novedoso marco modular para la resoluciĂłn de relaciones recurrentes, combinando y aprovechando los resultados de resolutores existentes. AdemĂĄs de ello, hemos desarrollado e integrado en nuestro nuevo resolutor una tĂ©cnica para la obtenciĂłn de cotas superiores en forma cerrada de una clase caracterĂ­stica de relaciones recurrentes que surgen durante el anĂĄlisis de programas lĂłgicos con parĂĄmetros de acumulaciĂłn. Finalmente, hemos integrado el nuevo anĂĄlisis de recursos con el marco general para verificaciĂłn de recursos de CiaoPP, y hemos instanciado dicho marco para la verificaciĂłn de especificaciones sobre el consumo de energĂ­a de programas imperativas embarcados, mostrando la viabilidad y utilidad de la herramienta resultante en una aplicaciĂłn real

    Parallelization of dynamic programming recurrences in computational biology

    Get PDF
    The rapid growth of biosequence databases over the last decade has led to a performance bottleneck in the applications analyzing them. In particular, over the last five years DNA sequencing capacity of next-generation sequencers has been doubling every six months as costs have plummeted. The data produced by these sequencers is overwhelming traditional compute systems. We believe that in the future compute performance, not sequencing, will become the bottleneck in advancing genome science. In this work, we investigate novel computing platforms to accelerate dynamic programming algorithms, which are popular in bioinformatics workloads. We study algorithm-specific hardware architectures that exploit fine-grained parallelism in dynamic programming kernels using field-programmable gate arrays: FPGAs). We advocate a high-level synthesis approach, using the recurrence equation abstraction to represent dynamic programming and polyhedral analysis to exploit parallelism. We suggest a novel technique within the polyhedral model to optimize for throughput by pipelining independent computations on an array. This design technique improves on the state of the art, which builds latency-optimal arrays. We also suggest a method to dynamically switch between a family of designs using FPGA reconfiguration to achieve a significant performance boost. We have used polyhedral methods to parallelize the Nussinov RNA folding algorithm to build a family of accelerators that can trade resources for parallelism and are between 15-130x faster than a modern dual core CPU implementation. A Zuker RNA folding accelerator we built on a single workstation with four Xilinx Virtex 4 FPGAs outperforms 198 3 GHz Intel Core 2 Duo processors. Furthermore, our design running on a single FPGA is an order of magnitude faster than competing implementations on similar-generation FPGAs and graphics processors. Our work is a step toward the goal of automated synthesis of hardware accelerators for dynamic programming algorithms
    • 

    corecore