195 research outputs found
Deriving divide-and-conquer dynamic programming algorithms using solver-aided transformations
We introduce a framework allowing domain experts to manipulate computational terms in the interest of deriving better, more efficient implementations.It employs deductive reasoning to generate provably correct efficient implementations from a very high-level specification of an algorithm, and inductive constraint-based synthesis to improve automation. Semantic information is encoded into program terms through the use of refinement types.
In this paper, we develop the technique in the context of a system called Bellmania that uses solver-aided tactics to derive parallel divide-and-conquer implementations of dynamic programming algorithms that have better locality and are significantly more efficient than traditional loop-based implementations. Bellmania includes a high-level language for specifying dynamic programming algorithms and a calculus that facilitates gradual transformation of these specifications into efficient implementations. These transformations formalize the divide-and conquer technique; a visualization interface helps users to interactively guide the process, while an SMT-based back-end verifies each step and takes care of low-level reasoning required for parallelism.
We have used the system to generate provably correct implementations of several algorithms, including some important algorithms from computational biology, and show that the performance is comparable to that of the best manually optimized code.National Science Foundation (U.S.) (CCF-1139056)National Science Foundation (U.S.) (CCF- 1439084)National Science Foundation (U.S.) (CNS-1553510
Hybrid analysis of memory references and its application to automatic parallelization
Executing sequential code in parallel on a multithreaded machine has been an
elusive goal of the academic and industrial research communities for many years. It
has recently become more important due to the widespread introduction of multicores
in PCs. Automatic multithreading has not been achieved because classic, static
compiler analysis was not powerful enough and program behavior was found to be, in
many cases, input dependent. Speculative thread level parallelization was a welcome
avenue for advancing parallelization coverage but its performance was not always optimal
due to the sometimes unnecessary overhead of checking every dynamic memory
reference.
In this dissertation we introduce a novel analysis technique, Hybrid Analysis,
which unifies static and dynamic memory reference techniques into a seamless compiler
framework which extracts almost maximum available parallelism from scientific
codes and incurs close to the minimum necessary run time overhead. We present how
to extract maximum information from the quantities that could not be sufficiently
analyzed through static compiler methods, and how to generate sufficient conditions
which, when evaluated dynamically, can validate optimizations.
Our techniques have been fully implemented in the Polaris compiler and resulted
in whole program speedups on a large number of industry standard benchmark applications
æšăçšăăæ§é ć䞊ćăăă°ă©ăăłă°
High-level abstractions for parallel programming are still immature. Computations on complicated data structures such as pointer structures are considered as irregular algorithms. General graph structures, which irregular algorithms generally deal with, are difficult to divide and conquer. Because the divide-and-conquer paradigm is essential for load balancing in parallel algorithms and a key to parallel programming, general graphs are reasonably difficult. However, trees lead to divide-and-conquer computations by definition and are sufficiently general and powerful as a tool of programming. We therefore deal with abstractions of tree-based computations. Our study has started from Matsuzakiâs work on tree skeletons. We have improved the usability of tree skeletons by enriching their implementation aspect. Specifically, we have dealt with two issues. We first have implemented the loose coupling between skeletons and data structures and developed a flexible tree skeleton library. We secondly have implemented a parallelizer that transforms sequential recursive functions in C into parallel programs that use tree skeletons implicitly. This parallelizer hides the complicated API of tree skeletons and makes programmers to use tree skeletons with no burden. Unfortunately, the practicality of tree skeletons, however, has not been improved. On the basis of the observations from the practice of tree skeletons, we deal with two application domains: program analysis and neighborhood computation. In the domain of program analysis, compilers treat input programs as control-flow graphs (CFGs) and perform analysis on CFGs. Program analysis is therefore difficult to divide and conquer. To resolve this problem, we have developed divide-and-conquer methods for program analysis in a syntax-directed manner on the basis of Rosenâs high-level approach. Specifically, we have dealt with data-flow analysis based on Tarjanâs formalization and value-graph construction based on a functional formalization. In the domain of neighborhood computations, a primary issue is locality. A naive parallel neighborhood computation without locality enhancement causes a lot of cache misses. The divide-and-conquer paradigm is known to be useful also for locality enhancement. We therefore have applied algebraic formalizations and a tree-segmenting technique derived from tree skeletons to the locality enhancement of neighborhood computations.é»æ°é俥性ćŠ201
A bibliography on parallel and vector numerical algorithms
This is a bibliography of numerical methods. It also includes a number of other references on machine architecture, programming language, and other topics of interest to scientific computing. Certain conference proceedings and anthologies which have been published in book form are listed also
Staged parser combinators for efficient data processing
Parsers are ubiquitous in computing, and many applications depend on their performance for decoding data efficiently. Parser combinators are an intuitive tool for writing parsers: tight integration with the host language enables grammar specifications to be interleaved with processing of parse results. Unfortunately, parser combinators are typically slow due to the high overhead of the host language abstraction mechanisms that enable composition. We present a technique for eliminating such overhead. We use staging, a form of runtime code generation, to dissociate input parsing from parser composition, and eliminate intermediate data structures and computations associated with parser composition at staging time. A key challenge is to maintain support for input dependent grammars, which have no clear stage distinction. Our approach applies to top-down recursive-descent parsers as well as bottom-up nondeterministic parsers with key applications in dynamic programming on sequences, where we auto-generate code for parallel hardware. We achieve performance comparable to specialized, hand-written parsers
Improved static analysis and verification of energy consumption and other resources via abstract interpretation
Resource analysis aims at inferring the cost of executing programs for any possible input,
in terms of a given resource, such as the traditional execution steps, time ormemory,
and, more recently energy consumption or user defined resources (e.g., number of
bits sent over a socket, number of database accesses, number of calls to particular procedures,
etc.). This is performed statically, i.e., without actually running the programs.
Resource usage information is useful for a variety of optimization and verification
applications, as well as for guiding software design. For example, programmers
can use such information to choose different algorithmic solutions to a problem; program
transformation systems can use cost information to choose between alternative
transformations; parallelizing compilers can use cost estimates for granularity control,
which tries to balance the overheads of task creation and manipulation against the
benefits of parallelization.
In this thesis we have significatively improved an existing prototype implementation
for resource usage analysis based on abstract interpretation, addressing a number
of relevant challenges and overcoming many limitations it presented. The goal of that
prototype was to show the viability of casting the resource analysis as an abstract domain,
and howit could overcome important limitations of the state-of-the-art resource
usage analysis tools. For this purpose, it was implemented as an abstract domain in the
abstract interpretation framework of the CiaoPP system, PLAI.We have improved both
the design and implementation of the prototype, for eventually allowing an evolution
of the tool to the industrial application level.
The abstract operations of such tool heavily depend on the setting up and finding
closed-form solutions of recurrence relations representing the resource usage behavior
of program components and the whole program as well. While there exist many
tools, such as Computer Algebra Systems (CAS) and libraries able to find closed-form
solutions for some types of recurrences, none of them alone is able to handle all the
types of recurrences arising during program analysis. In addition, there are some types
of recurrences that cannot be solved by any existing tool. This clearly constitutes a bottleneck
for this kind of resource usage analysis. Thus, one of the major challenges we
have addressed in this thesis is the design and development of a novel modular framework
for solving recurrence relations, able to combine and take advantage of the results
of existing solvers. Additionally, we have developed and integrated into our novel
solver a technique for finding upper-bound closed-form solutions of a special class of
recurrence relations that arise during the analysis of programs with accumulating parameters.
Finally, we have integrated the improved resource analysis into the CiaoPP general
framework for resource usage verification, and specialized the framework for verifying
energy consumption specifications of embedded imperative programs in a real application,
showing the usefulness and practicality of the resulting tool.---ABSTRACT---El AnĂĄlisis de recursos tiene como objetivo inferir el coste de la ejecuciĂłn de programas
para cualquier entrada posible, en tĂ©rminos de algĂșn recurso determinado, como
pasos de ejecuciĂłn, tiempo o memoria, y, mĂĄs recientemente, el consumo de energĂa
o recursos definidos por el usuario (por ejemplo, nĂșmero de bits enviados a travĂ©s de
un socket, el nĂșmero de accesos a una base de datos, cantidad de llamadas a determinados
procedimientos, etc.). Ello se realiza estĂĄticamente, es decir, sin necesidad de
ejecutar los programas.
La informaciĂłn sobre el uso de recursos resulta muy Ăștil para una gran variedad
de aplicaciones de optimizaciĂłn y verificaciĂłn de programas, asĂ como para asistir en
el diseño de los mismos. Por ejemplo, los programadores pueden utilizar dicha información
para elegir diferentes soluciones algorĂtmicas a un problema; los sistemas de
transformaciĂłn de programas pueden utilizar la informaciĂłn de coste para elegir entre
transformaciones alternativas; los compiladores paralelizantes pueden utilizar las estimaciones
de coste para realizar control de granularidad, el cual trata de equilibrar el
coste debido a la creaciĂłn y gestiĂłn de tareas, con los beneficios de la paralelizaciĂłn.
En esta tesis hemos mejorado de manera significativa la implementaciĂłn de un
prototipo existente para el anĂĄlisis del uso de recursos basado en interpretaciĂłn abstracta,
abordando diversos desafĂos relevantes y superando numerosas limitaciones
que Ă©ste presentaba. El objetivo de dicho prototipo era mostrar la viabilidad de definir
el anĂĄlisis de recursos como un dominio abstracto, y cĂłmo se podĂan superar las limitaciones
de otras herramientas similares que constituyen el estado del arte. Para ello,
se implementĂł como un dominio abstracto en el marco de interpretaciĂłn abstracta
presente en el sistema CiaoPP, PLAI. Hemos mejorado tanto el diseño como la implementación
del mencionado prototipo para posibilitar su evoluciĂłn hacia una herramienta
utilizable en el ĂĄmbito industrial.
Las operaciones abstractas de dicha herramienta dependen en gran medida de la
generaciĂłn, y posterior bĂșsqueda de soluciones en forma cerrada, de relaciones recurrentes,
las cuales modelizan el comportamiento, respecto al consumo de recursos, de
los componentes del programa y del programa completo. Si bien existen actualmente
muchas herramientas capaces de encontrar soluciones en forma cerrada para ciertos
tipos de recurrencias, tales como Sistemas de ComputaciĂłn Algebraicos (CAS) y librerĂas
de programaciĂłn, ninguna de dichas herramientas es capaz de tratar, por sĂ sola,
todos los tipos de recurrencias que surgen durante el anĂĄlisis de recursos. Existen incluso
recurrencias que no las puede resolver ninguna herramienta actual. Esto constituye
claramente un cuello de botella para este tipo de anĂĄlisis del uso de recursos. Por lo tanto, uno de los principales desafĂos que hemos abordado en esta tesis es el diseño
y desarrollo de un novedoso marco modular para la resoluciĂłn de relaciones recurrentes,
combinando y aprovechando los resultados de resolutores existentes. AdemĂĄs
de ello, hemos desarrollado e integrado en nuestro nuevo resolutor una técnica para
la obtenciĂłn de cotas superiores en forma cerrada de una clase caracterĂstica de relaciones
recurrentes que surgen durante el anĂĄlisis de programas lĂłgicos con parĂĄmetros
de acumulaciĂłn.
Finalmente, hemos integrado el nuevo anĂĄlisis de recursos con el marco general
para verificaciĂłn de recursos de CiaoPP, y hemos instanciado dicho marco para la verificaciĂłn
de especificaciones sobre el consumo de energĂa de programas imperativas
embarcados, mostrando la viabilidad y utilidad de la herramienta resultante en una
aplicaciĂłn real
Parallelization of dynamic programming recurrences in computational biology
The rapid growth of biosequence databases over the last decade has led to a performance bottleneck in the applications analyzing them. In particular, over the last five years DNA sequencing capacity of next-generation sequencers has been doubling every six months as costs have plummeted. The data produced by these sequencers is overwhelming traditional compute systems. We believe that in the future compute performance, not sequencing, will become the bottleneck in advancing genome science. In this work, we investigate novel computing platforms to accelerate dynamic programming algorithms, which are popular in bioinformatics workloads. We study algorithm-specific hardware architectures that exploit fine-grained parallelism in dynamic programming kernels using field-programmable gate arrays: FPGAs). We advocate a high-level synthesis approach, using the recurrence equation abstraction to represent dynamic programming and polyhedral analysis to exploit parallelism. We suggest a novel technique within the polyhedral model to optimize for throughput by pipelining independent computations on an array. This design technique improves on the state of the art, which builds latency-optimal arrays. We also suggest a method to dynamically switch between a family of designs using FPGA reconfiguration to achieve a significant performance boost. We have used polyhedral methods to parallelize the Nussinov RNA folding algorithm to build a family of accelerators that can trade resources for parallelism and are between 15-130x faster than a modern dual core CPU implementation. A Zuker RNA folding accelerator we built on a single workstation with four Xilinx Virtex 4 FPGAs outperforms 198 3 GHz Intel Core 2 Duo processors. Furthermore, our design running on a single FPGA is an order of magnitude faster than competing implementations on similar-generation FPGAs and graphics processors. Our work is a step toward the goal of automated synthesis of hardware accelerators for dynamic programming algorithms
Recommended from our members
High performance Monte Carlo computation for finance risk data analysis
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.Finance risk management has been playing an increasingly important role in the finance sector, to analyse finance data and to prevent any potential crisis. It has been widely recognised that Value at Risk (VaR) is an effective method for finance risk management and evaluation. This thesis conducts a comprehensive review on a number of VaR methods and discusses in depth their strengths and limitations. Among these VaR methods, Monte Carlo simulation and analysis has proven to be the most accurate VaR method in finance risk evaluation due to its strong modelling capabilities. However, one major challenge in Monte Carlo analysis is its high computing complexity of O(nÂČ). To speed up the computation in Monte Carlo analysis, this thesis parallelises Monte Carlo using the MapReduce model, which has become a major software programming model in support of data intensive applications. MapReduce consists of two functions - Map and Reduce. The Map function segments a large data set into small data chunks and distribute these data chunks among a number of computers for processing in parallel with a Mapper processing a data chunk on a computing node. The Reduce function collects the results generated by these Map nodes (Mappers) and generates an output. The parallel Monte Carlo is evaluated initially in a small scale MapReduce experimental environment, and subsequently evaluated in a large scale simulation environment. Both experimental and simulation results show that the MapReduce based parallel Monte Carlo is greatly faster than the sequential Monte Carlo in computation, and the accuracy level is maintained as well. In data intensive applications, moving huge volumes of data among the computing nodes could incur high overhead in communication. To address this issue, this thesis further considers data locality in the MapReduce based parallel Monte Carlo, and evaluates the impacts of data locality on the performance in computation
- âŠ