2 research outputs found

    Profiling a parallel domain specific language using off-the-shelf tools

    Get PDF
    Profiling tools are essential for understanding and tuning the performance of both parallel programs and parallel language implementations. Assessing the performance of a program in a language with high-level parallel coordination is often complicated by the layers of abstraction present in the language and its implementation. This thesis investigates whether it is possible to profile parallel Domain Specific Languages (DSLs) using existing host language profiling tools. The key challenge is that the host language tools report the performance of the DSL runtime system (RTS) executing the application rather than the performance of the DSL application. The key questions are whether a correct, effective and efficient profiler can be constructed using host language profiling tools; is it possible to effectively profile the DSL implementation, and what capabilities are required of the host language profiling tools? The main contribution of this thesis is the development of an execution profiler for the parallel DSL, Haskell Distributed Parallel Haskell (HdpH) using the host language profiling tools. We show that it is possible to construct a profiler (HdpHProf) to support performance analysis of both the DSL applications and the DSL implementation. The implementation uses several new GHC features, including the GHC-Events Library and ThreadScope, develops two new performance analysis tools for DSL HdpH internals, i.e. Spark Pool Contention Analysis, and Registry Contention Analysis. We present a critical comparative evaluation of the host language profiling tools that we used (GHC-PPS and ThreadScope) with another recent functional profilers, EdenTV, alongside four important imperative profilers. This is the first report on the performance of functional profilers in comparison with well established industrial standard imperative profiling technologies. We systematically compare the profilers for usability and data presentation. We found that the GHC-PPS performs well in terms of overheads and usability so using it to profile the DSL is feasible and would not have significant impact on the DSL performance. We validate HdpHProf for functional correctness and measure its performance using six benchmarks. HdpHProf works correctly and can scale to profile HdpH programs running on up to 192 cores of a 32 nodes Beowulf cluster. We characterise the performance of HdpHProf in terms of profiling data size and profiling execution runtime overhead. It shows that HdpHProf does not alter the behaviour of the GHC-PPS and retains low tracing overheads close to the studied functional profilers; 18% on average. Also, it shows a low ratio of HdpH trace events in GHC-PPS eventlog, less than 3% on average. We show that HdpHProf is effective and efficient to use for performance analysis and tuning of the DSL applications. We use HdpHProf to identify performance issues and to tune the thread granularity of six HdpH benchmarks with different parallel paradigms, e.g. divide and conquer, flat data parallel, and nested data parallel. This include identifying problems such as, too small/large thread granularity, problem size too small for the parallel architecture, and synchronisation bottlenecks. We show that HdpHProf is effective and efficient for tuning the parallel DSL implementation. We use the Spark Pool Contention Analysis tool to examine how the spark pool implementation performs when accessed concurrently. We found that appropriate thread granularity can significantly reduce both conflict ratios, and conflict durations, by more than 90%. We use the Registry Contention Analysis tool to evaluate three alternatives of the registry implementations. We found that the tools can give a better understanding of how different implementations of the HdpH RTS perform

    New techniques for adaptive program optimization

    Get PDF
    Adaptive optimization technology is a key ingredient in modern runtime systems. This technology aims at improving performance by making optimization decisions on the basis of a program’s observed behavior. Application virtual machines indeed face different and perhaps more compelling issues compared to traditional static optimizers, as dynamic language features can force the deferral of most effective optimizations until run time. In this thesis, we present novel ideas to improve adaptive optimization, focusing on two main problems: collecting fine-grained program profiles with low overhead to guide feedback-directed optimization, and supporting continuous optimization and deoptimization by diverting execution across dynamically generated code versions. We present two profiling techniques: the first works at inter-procedural level to collect calling context information for hot code portions, while the second captures cyclic-path profiles within a function’s boundaries. Both techniques rely on efficient and elegant data structures, advancing the state of the art of the theory and practice of the performance profiling literature. We then focus our attention on supporting continuous optimization through on-stack replacement (OSR) mechanisms. We devise a new OSR framework encoded entirely at intermediate-representation level, which extends the best OSR practices with the ability to perform OSR at nearly any program location. Our techniques pave the road to aggressive optimizations and debugging techniques that were not supported by previous approaches. The main technical challenge is how to automatically generate compensation code to fix the program’s state across an OSR transition between different code versions. We present a conceptual framework for OSR, distilling its essence to a core calculus with an operational semantics. Using bisimulation techniques, we describe how OSR can be correctly supported in the presence of common compiler optimizations, providing the first soundness results in this context. We implement our ideas in production systems such as Jikes RVM and the LLVM compiler toolchain, and evaluate their performance against a variety of prominent benchmarks. We investigate the end-to-end utility of our techniques in a series of case studies: we illustrate two possible applications of multi-iteration path profiling, and show how our OSR techniques advance the state of the art for MATLAB code optimization and for source-level debugging of optimized code. Part of the results of this thesis have been published in PLDI, OOPSLA, CGO, and Software Practice and Experience
    corecore