22 research outputs found
Type-Inference Based Short Cut Deforestation (nearly) without Inlining
Deforestation optimises a functional program by transforming it into another one that does not create certain intermediate data structures. In [ICFP'99] we presented a type-inference based deforestation algorithm which performs extensive inlining. However, across module boundaries only limited inlining is practically feasible. Furthermore, inlining is a non-trivial transformation which is therefore best implemented as a separate optimisation pass. To perform short cut deforestation (nearly) without inlining, Gill suggested to split definitions into workers and wrappers and inline only the small wrappers, which transfer the information needed for deforestation. We show that Gill's use of a function build limits deforestation and note that his reasons for using build do not apply to our approach. Hence we develop a more general worker/wrapper scheme without build. We give a type-inference based algorithm which splits definitions into workers and wrappers. Finally, we show that we can deforest more expressions with the worker/wrapper scheme than the algorithm with inlining
Catamorphism-based program transformations for non-strict functional languages
In functional languages intermediate data structures are often used as glue to connect
separate parts of a program together. These intermediate data structures are useful because
they allow modularity, but they are also a cause of inefficiency: each element need to be
allocated, to be examined, and to be deallocated.
Warm fusion is a program transformation technique which aims to eliminate intermediate
data structures. Functions in a program are first transformed into the so called build-cata
form, then fused via a one-step rewrite rule, the cata-build rule. In the process of the
transformation to build-cata form we attempt to replace explicit recursion with a fixed
pattern of recursion (catamorphism).
We analyse in detail the problem of removing - possibly mutually recursive sets of -
polynomial datatypes.
Wehave implemented the warm fusion method in the Glasgow Haskell Compiler, which has
allowed practical feedback. One important conclusion is that catamorphisms and fusion
in general deserve a more prominent role in the compilation process. We give a detailed
measurement of our implementation on a suite of real application programs
Modular, higher order cardinality analysis in theory and practice
Since the mid '80s, compiler writers for functional languages (especially lazy ones) have been writing papers about identifying and exploiting thunks and lambdas that are used only once. However, it has proved difficult to achieve both power and simplicity in practice. In this paper, we describe a new, modular analysis for a higher order language, which is both simple and effective. We prove the analysis sound with respect to a standard call-by-need semantics, and present measurements of its use in a full-scale, state-of-the-art optimising compiler. The analysis finds many single-entry thunks and one-shot lambdas and enables a number of program optimisations. This paper extends our preceding conference publication (Sergey et al. 2014 Proceedings of the 41st Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL 2014). ACM, pp. 335â348) with proofs, expanded report on evaluation and a detailed examination of the factors causing the loss of precision in the analysis
Hfusion : a fusion tool based on acid rain plus extensions
When constructing programs, it is a usual practice to compose algorithms that solve simpler problems to solve a more complex one. This principle adapts so well to software development because it provides a structure to understand, design, reuse and test programs. In functional languages, algorithms are usually connected through the use of intermediate data structures, which carry the data from one algorithm to another one. The data structures impose a load on the algorithms to allocate, traverse and deallocate them. To alleviate this ine ciency, automatic program transformations have been studied, which produce equivalent programs that make less use of intermediate data structures. We present a set of automatic program transformation techniques based on algebraic laws known as Acid Rain. These techniques allow to remove intermediate data structures in programs containing primitive recursive functions, mutually recursive functions and functions with multiple recursive arguments. We also provide an experimental implementation of our techniques which allows their application on user supplied programs
Liveness-Based Garbage Collection for Lazy Languages
We consider the problem of reducing the memory required to run lazy
first-order functional programs. Our approach is to analyze programs for
liveness of heap-allocated data. The result of the analysis is used to preserve
only live data---a subset of reachable data---during garbage collection. The
result is an increase in the garbage reclaimed and a reduction in the peak
memory requirement of programs. While this technique has already been shown to
yield benefits for eager first-order languages, the lack of a statically
determinable execution order and the presence of closures pose new challenges
for lazy languages. These require changes both in the liveness analysis itself
and in the design of the garbage collector.
To show the effectiveness of our method, we implemented a copying collector
that uses the results of the liveness analysis to preserve live objects, both
evaluated (i.e., in WHNF) and closures. Our experiments confirm that for
programs running with a liveness-based garbage collector, there is a
significant decrease in peak memory requirements. In addition, a sizable
reduction in the number of collections ensures that in spite of using a more
complex garbage collector, the execution times of programs running with
liveness and reachability-based collectors remain comparable
Hybrid eager and lazy evaluation for efficient compilation of Haskell
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2002.Includes bibliographical references (p. 208-220).This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.The advantage of a non-strict, purely functional language such as Haskell lies in its clean equational semantics. However, lazy implementations of Haskell fall short: they cannot express tail recursion gracefully without annotation. We describe resource-bounded hybrid evaluation, a mixture of strict and lazy evaluation, and its realization in Eager Haskell. From the programmer's perspective, Eager Haskell is simply another implementation of Haskell with the same clean equational semantics. Iteration can be expressed using tail recursion, without the need to resort to program annotations. Under hybrid evaluation, computations are ordinarily executed in program order just as in a strict functional language. When particular stack, heap, or time bounds are exceeded, suspensions are generated for all outstanding computations. These suspensions are re-started in a demand-driven fashion from the root. The Eager Haskell compiler translates Ac, the compiler's intermediate representation, to efficient C code. We use an equational semantics for Ac to develop simple correctness proofs for program transformations, and connect actions in the run-time system to steps in the hybrid evaluation strategy.(cont.) The focus of compilation is efficiency in the common case of straight-line execution; the handling of non-strictness and suspension are left to the run-time system. Several additional contributions have resulted from the implementation of hybrid evaluation. Eager Haskell is the first eager compiler to use a call stack. Our generational garbage collector uses this stack as an additional predictor of object lifetime. Objects above a stack watermark are assumed to be likely to die; we avoid promoting them. Those below are likely to remain untouched and therefore are good candidates for promotion. To avoid eagerly evaluating error checks, they are compiled into special bottom thunks, which are treated specially by the run-time system. The compiler identifies error handling code using a mixture of strictness and type information. This information is also used to avoid inlining error handlers, and to enable aggressive program transformation in the presence of error handling.by Jan-Willem Maessen.Ph.D
HERMIT: Mechanized Reasoning during Compilation in the Glasgow Haskell Compiler
It is difficult to write programs which are both correct and fast. A promising approach, functional programming, is based on the idea of using pure, mathematical functions to construct programs. With effort, it is possible to establish a connection between a specification written in a functional language, which has been proven correct, and a fast implementation, via program transformation. When practiced in the functional programming community, this style of reasoning is still typically performed by hand, by either modifying the source code or using pen-and-paper. Unfortunately, performing such semi-formal reasoning by directly modifying the source code often obfuscates the program, and pen-and-paper reasoning becomes outdated as the program changes over time. Even so, this semi-formal reasoning prevails because formal reasoning is time-consuming, and requires considerable expertise. Formal reasoning tools often only work for a subset of the target language, or require programs to be implemented in a custom language for reasoning. This dissertation investigates a solution, called HERMIT, which mechanizes reasoning during compilation. HERMIT can be used to prove properties about programs written in the Haskell functional programming language, or transform them to improve their performance. Reasoning in HERMIT proceeds in a style familiar to practitioners of pen-and-paper reasoning, and mechanization allows these techniques to be applied to real-world programs with greater confidence. HERMIT can also re-check recorded reasoning steps on subsequent compilations, enforcing a connection with the program as the program is developed. HERMIT is the first system capable of directly reasoning about the full Haskell language. The design and implementation of HERMIT, motivated both by typical reasoning tasks and HERMIT's place in the Haskell ecosystem, is presented in detail. Three case studies investigate HERMIT's capability to reason in practice. These case studies demonstrate that semi-formal reasoning with HERMIT lowers the barrier to writing programs which are both correct and fast
Recommended from our members
Compiling Irregular Software to Specialized Hardware
High-level synthesis (HLS) has simplified the design process for energy-efficient hardware accelerators: a designer specifies an acceleratorâs behavior in a âhigh-levelâ language, and a toolchain synthesizes register-transfer level (RTL) code from this specification. Many HLS systems produce efficient hardware designs for regular algorithms (i.e., those with limited conditionals or regular memory access patterns), but most struggle with irregular algorithms that rely on dynamic, data-dependent memory access patterns (e.g., traversing pointer-based structures like lists, trees, or graphs). HLS tools typically provide imperative, side-effectful languages to the designer, which makes it difficult to correctly specify and optimize complex, memory-bound applications.
In this dissertation, I present an alternative HLS methodology that leverages properties of functional languages to synthesize hardware for irregular algorithms. The main contribution is an optimizing compiler that translates pure functional programs into modular, parallel dataflow networks in hardware. I give an overview of this compiler, explain how its source and target together enable parallelism in the face of irregularity, and present two specific optimizations that further exploit this parallelism. Taken together, this dissertation verifies my thesis that pure functional programs exhibiting irregular memory access patterns can be compiled into specialized hardware and optimized for parallelism.
This work extends the scope of modern HLS toolchains. By relying on properties of pure functional languages, our compiler can synthesize hardware from programs containing constructs that commercial HLS tools prohibit, e.g., recursive functions and dynamic memory allocation. Hardware designers may thus use our compiler in conjunction with existing HLS systems to accelerate a wider class of algorithms than before
Supporting high-level, high-performance parallel programming with library-driven optimization
Parallel programming is a demanding task for developers partly because achieving scalable parallel speedup requires drawing upon a repertoire of complex, algorithm-specific, architecture-aware programming techniques. Ideally, developers of programming tools would be able to build algorithm-specific, high-level programming interfaces that hide the complex architecture-aware details. However, it is a monumental undertaking to develop such tools from scratch, and it is challenging to provide reusable functionality for developing such tools without sacrificing the hosted interfaceâs performance or ease of use. In particular, to get high performance on a cluster of multicore computers without requiring developers to manually place data and computation onto processors, it is necessary to combine prior methods for shared memory parallelism with new methods for algorithm-aware distribution of computation and data across the cluster.
This dissertation presents Triolet, a programming language and compiler for high-level programming of parallel loops for high-performance execution on clusters of multicore computers. Triolet adopts a simple, familiar programming interface based on traversing collections of data. By incorporating semantic knowledge of how traversals behave, Triolet achieves efficient parallel execution and communication. Moreover, Trioletâs performance on sequential loops is comparable to that of low-level C code, ranging from seven percent slower to 2.8Ă slower on tested benchmarks. Trioletâs design demonstrates that it is possible to decouple the design of a compiler from the implementation of parallelism without sacrificing performance or ease of use: parallel and sequential loops are implemented as library code and compiled to efficient code by an optimizing compiler that is unaware of parallelism beyond the scope of a single thread. All handling of parallel work partitioning, data partitioning, and scheduling is embodied in library code. During compilation, library code is inlined into a program and specialized to yield customized parallel loops. Experimental results from a 128-core cluster (with 8 nodes and 16 cores per node) show that loops in Triolet outperform loops in Eden, a similar high-level language. Triolet achieves significant parallel speedup over sequential C code, with performance ranging from slightly faster to 4.3Ă slower than manually parallelized C code on compute-intensive loops. Thus, Triolet demonstrates that a library of container traversal functions can deliver cluster-parallel performance comparable to manually parallelized C code without requiring programmers to manage parallelism. This programming approach opens the potential for future research into parallel programming frameworks
Program Analysis and Compilation Techniques for Speeding up Transactional Database Workloads
There is a trend towards increased specialization of data management software for performance reasons. The improved performance not only leads to a more efficient usage of the underlying hardware and cuts the operation costs of the system, but also is a game-changing competitive advantage for many emerging application domains such as high-frequency algorithmic trading, clickstream analysis, infrastructure monitoring, fraud detection, and online advertising to name a few. In this thesis, we study the automatic specialization and optimization of database application programs -- sequences of queries and updates, augmented with control flow constructs as they appear in database scripts, user-defined functions (UDFs), transactional workloads and triggers in languages such as PL/SQL. We propose to build online transaction processing (OLTP) systems around a modern compiler infrastructure. We show how to build an optimizing compiler for transaction programs using generative programming and state-of-the-art compiler technology, and present techniques for aggressive code inlining, fusion, deforestation, and data structure specialization in the domain of relational transaction programs. We also identify and explore the key optimizations that can be applied in this domain. In addition, we study the advantage of using program dependency analysis and restructuring to enable the concurrency control algorithms to achieve higher performance. Traditionally, optimistic concurrency control algorithms, such as optimistic Multi-Version Concurrency Control (MVCC), avoid blocking concurrent transactions at the cost of having a validation phase. Upon failure in the validation phase, the transaction is usually aborted and restarted from scratch. The "abort and restart" approach becomes a performance bottleneck for use cases with high contention objects or long running transactions. In addition, restarting from scratch creates a negative feedback loop in the system, because the system incurs additional overhead that may create even more conflicts. However, using the dependency information inside the transaction programs, we propose a novel transaction repair approach for in-memory databases. This low overhead approach summarizes the transaction programs in the form of a dependency graph. The dependency graph also contains the constructs used in the validation phase of the MVCC algorithm. Then, when encountering conflicts among transactions, our mechanism quickly detects the conflict locations in the program and partially re-executes the conflicting transactions. This approach maximizes the reuse of the computations done in the first execution round and increases the transaction processing throughput. We evaluate the proposed ideas and techniques in the thesis on some popular benchmarks such as TPC-C and modified versions of TPC-H and TPC-E, as well as other micro-benchmarks. We show that applying these techniques leads to 2x-100x performance improvement in many use cases. Besides, by selectively disabling some of the optimizations in the compiler, we derive a clinical and precise way of obtaining insight into their individual performance contributions