89 research outputs found

    StreamJIT: A Commensal Compiler for High-Performance Stream Programming

    Get PDF
    There are many domain libraries, but despite the performance benefits of compilation, domain-specific languages are comparatively rare due to the high cost of implementing an optimizing compiler. We propose commensal compilation, a new strategy for compiling embedded domain-specific languages by reusing the massive investment in modern language virtual machine platforms. Commensal compilers use the host language's front-end, use host platform APIs that enable back-end optimizations by the host platform JIT, and use an autotuner for optimization selection. The cost of implementing a commensal compiler is only the cost of implementing the domain-specific optimizations. We demonstrate the concept by implementing a commensal compiler for the stream programming language StreamJIT atop the Java platform. Our compiler achieves performance 2.8 times better than the StreamIt native code (via GCC) compiler with considerably less implementation effort.United States. Dept. of Energy. Office of Science (X-Stack Award DE-SC0008923)Intel Corporation (Science and Technology Center for Big Data)SMART3 Graduate Fellowshi

    Deep Static Modeling of invokedynamic

    Get PDF
    Java 7 introduced programmable dynamic linking in the form of the invokedynamic framework. Static analysis of code containing programmable dynamic linking has often been cited as a significant source of unsoundness in the analysis of Java programs. For example, Java lambdas, introduced in Java 8, are a very popular feature, which is, however, resistant to static analysis, since it mixes invokedynamic with dynamic code generation. These techniques invalidate static analysis assumptions: programmable linking breaks reasoning about method resolution while dynamically generated code is, by definition, not available statically. In this paper, we show that a static analysis can predictively model uses of invokedynamic while also cooperating with extra rules to handle the runtime code generation of lambdas. Our approach plugs into an existing static analysis and helps eliminate all unsoundness in the handling of lambdas (including associated features such as method references) and generic invokedynamic uses. We evaluate our technique on a benchmark suite of our own and on third-party benchmarks, uncovering all code previously unreachable due to unsoundness, highly efficiently

    Building-Blocks for Performance Oriented DSLs

    Full text link
    Domain-specific languages raise the level of abstraction in software development. While it is evident that programmers can more easily reason about very high-level programs, the same holds for compilers only if the compiler has an accurate model of the application domain and the underlying target platform. Since mapping high-level, general-purpose languages to modern, heterogeneous hardware is becoming increasingly difficult, DSLs are an attractive way to capitalize on improved hardware performance, precisely by making the compiler reason on a higher level. Implementing efficient DSL compilers is a daunting task however, and support for building performance-oriented DSLs is urgently needed. To this end, we present the Delite Framework, an extensible toolkit that drastically simplifies building embedded DSLs and compiling DSL programs for execution on heterogeneous hardware. We discuss several building blocks in some detail and present experimental results for the OptiML machine-learning DSL implemented on top of Delite.Comment: In Proceedings DSL 2011, arXiv:1109.032

    Lightweight Modular Staging and Embedded Compilers:Abstraction without Regret for High-Level High-Performance Programming

    Get PDF
    Programs expressed in a high-level programming language need to be translated to a low-level machine dialect for execution. This translation is usually accomplished by a compiler, which is able to translate any legal program to equivalent low-level code. But for individual source programs, automatic translation does not always deliver good results: Software engineering practice demands generalization and abstraction, whereas high performance demands specialization and concretization. These goals are at odds, and compilers can only rarely translate expressive high-level programs tomodern hardware platforms in a way that makes best use of the available resources. Explicit program generation is a promising alternative to fully automatic translation. Instead of writing down the program and relying on a compiler for translation, developers write a program generator, which produces a specialized, efficient, low-level program as its output. However, developing high-quality program generators requires a very large effort that is often hard to amortize. In this thesis, we propose a hybrid design: Integrate compilers into programs so that programs can take control of the translation process, but rely on libraries of common compiler functionality for help. We present Lightweight Modular Staging (LMS), a generative programming approach that lowers the development effort significantly. LMS combines program generator logic with the generated code in a single program, using only types to distinguish the two stages of execution. Through extensive use of component technology, LMS makes a reusable and extensible compiler framework available at the library level, allowing programmers to tightly integrate domain-specific abstractions and optimizations into the generation process, with common generic optimizations provided by the framework. Compared to previous work on programgeneration, a key aspect of our design is the use of staging not only as a front-end, but also as a way to implement internal compiler passes and optimizations, many of which can be combined into powerful joint simplification passes. LMS is well suited to develop embedded domain specific languages (DSLs) and has been used to develop powerful performance-oriented DSLs for demanding domains such as machine learning, with code generation for heterogeneous platforms including GPUs. LMS has also been used to generate SQL for embedded database queries and JavaScript for web applications

    Miniphases: Compilation using Modular and Efficient Tree Transformations

    Get PDF
    Production compilers commonly perform dozens of transformations on an intermediate representation. Running those transformations in separate passes harms performance. One approach to recover performance is to combine transformations by hand in order to reduce number of passes. Such an approach harms modularity, and thus makes it hard to maintain and evolve a compiler over the long term, and makes reasoning about performance harder. This paper describes a methodology that allows a compiler writer to define multiple transformations separately, but fuse them into a single traversal of the intermediate representation when the compiler runs. This approach has been implemented in a compiler for the Scala language. Our performance evaluation indicates that this approach reduces the running time of tree transformations by 35\% and shows that this is due to improved cache friendliness. At the same time, the approach improves total memory consumption by reducing the object tenuring rate by 50\%. This approach enables compiler writers to write transformations that are both modular and fast at the same time

    Machine Learning at Microsoft with ML .NET

    Full text link
    Machine Learning is transitioning from an art and science into a technology available to every developer. In the near future, every application on every platform will incorporate trained models to encode data-based decisions that would be impossible for developers to author. This presents a significant engineering challenge, since currently data science and modeling are largely decoupled from standard software development processes. This separation makes incorporating machine learning capabilities inside applications unnecessarily costly and difficult, and furthermore discourage developers from embracing ML in first place. In this paper we present ML .NET, a framework developed at Microsoft over the last decade in response to the challenge of making it easy to ship machine learning models in large software applications. We present its architecture, and illuminate the application demands that shaped it. Specifically, we introduce DataView, the core data abstraction of ML .NET which allows it to capture full predictive pipelines efficiently and consistently across training and inference lifecycles. We close the paper with a surprisingly favorable performance study of ML .NET compared to more recent entrants, and a discussion of some lessons learned

    RHEA: a reactive, heterogeneous, extensible and abstract framework for dataflow programming

    Get PDF
    Το υπολογιστικό μοντέλο ροών δεδομένων μας επιτρέπει να γράφουμε προγράμματα με παραλληλία υψηλού βαθμού, τα οποία θα εκτελεστούν σε ένα ετερογενές δίκτυο, με έναν συμπαγή και ευανάγνωστο τρόπο. Το κύριο πλεονέκτημα είναι το γεγονός ότι το σύστημα μπορεί να χωριστεί εννοιολογικά σε διάφορα ανεξάρτητα μέρη τα οποία μπορούν να εκτε- λεστούν παράλληλα και σε διαφορετικές μηχανές. Ως εκ τούτου, ο ταυτοχρονισμός και η κατανομή είναι υπονοούμενα και ο προγραμματιστής έχει λίγη, ως καθόλου, ευθύνη γι’ αυτά. Το προγραμματιστικό περιβάλλον που προτείνεται στην παρούσα πτυχιακή εργασία συνιστά το θεμελιώδες σύστημα που καθιστά δυνατό αυτόν τον τρόπο προγραμματισμού σε γλώσσες βασισμένες στο JVM (πχ Java, Scala, Closure), ενώ ταυτόχρονα κάνει πιο εύκολη την ενσωμάτωση άλλων τεχνολογιών που βασίζονται στο PubSub μοντέλο, με σκοπό να απομακρυνθούμε από τη χρήση προστακτικών γλώσσών και να υπεισέλθουμε σε ένα υψηλότερο επίπεδο αφαίρεσης. Ιδιαίτερη έμφαση δόθηκε σε τρεις τομείς: Μεγάλα Δεδομένα, Ρομποτική και Διαδίκτυο των Πραγμάτων.The dataflow computational model enables writing highly parallel programs, which will be deployed on a heterogeneous network, in a concise and readable way. The main advan- tage is the fact that the system can be conceptually separated into several independent components that can be run in parallel and deployed on different machines. Therefore, concurrency and distribution is implicit and little or no responsibility is given to the pro- grammer. The framework proposed in this thesis constitutes the underlying system that make this style of programming possible in JVM-based languages (e.g. Java, Scala, Clo- jure), while at the same time making it easy to integrate other technologies that rely on the PubSub model, in order to move away from imperative languages and enter a higher level of abstraction. Particular emphasis was put on three domains, namely Big Data, Robotics and IoT

    Spatial Statistical Data Fusion on Java-enabled Machines in Ubiquitous Sensor Networks

    Get PDF
    Wireless Sensor Networks (WSN) consist of small, cheap devices that have a combination of sensing, computing and communication capabilities. They must be able to communicate and process data efficiently using minimum amount of energy and cover an area of interest with the minimum number of sensors. This thesis proposes the use of techniques that were designed for Geostatistics and applies them to WSN field. Kriging and Cokriging interpolation that can be considered as Information Fusion algorithms were tested to prove the feasibility of the methods to increase coverage. To reduce energy consumption, a compression method that models correlations based on variograms was developed. A second challenge is to establish the communication to the external networks and to react to unexpected events. A demonstrator that uses commercial Java-enabled devices was implemented. It is able to perform remote monitoring, send SMS alarms and deploy remote updates

    Heterogeneous parallel virtual machine: A portable program representation and compiler for performance and energy optimizations on heterogeneous parallel systems

    Get PDF
    Programming heterogeneous parallel systems, such as the SoCs (System-on-Chip) on mobile and edge devices is extremely difficult; the diverse parallel hardware they contain exposes vastly different hardware instruction sets, parallelism models and memory systems. Moreover, a wide range of diverse hardware and software approximation techniques are available for applications targeting heterogeneous SoCs, further exacerbating the programmability challenges. In this thesis, we alleviate the programmability challenges of such systems using flexible compiler intermediate representation solutions, in order to benefit from the performance and superior energy efficiency of heterogeneous systems. First, we develop Heterogeneous Parallel Virtual Machine (HPVM), a parallel program representation for heterogeneous systems, designed to enable functional and performance portability across popular parallel hardware. HPVM is based on a hierarchical dataflow graph with side effects. HPVM successfully supports three important capabilities for programming heterogeneous systems: a compiler intermediate representation (IR), a virtual instruction set (ISA), and a basis for runtime scheduling. We use the HPVM representation to implement an HPVM prototype, defining the HPVM IR as an extension of the Low Level Virtual Machine (LLVM) IR. Our results show comparable performance with optimized OpenCL kernels for the target hardware from a single HPVM representation using translators from HPVM virtual ISA to native code, IR optimizations operating directly on the HPVM representation, and the capability for supporting flexible runtime scheduling schemes from a single HPVM representation. We extend HPVM to ApproxHPVM, introducing hardware-independent approximation metrics in the IR to enable maintaining accuracy information at the IR level and mapping of application-level end-to-end quality metrics to system level "knobs". The approximation metrics quantify the acceptable accuracy loss for individual computations. Application programmers only need to specify high-level, and end-to-end, quality metrics, instead of detailed parameters for individual approximation methods. The ApproxHPVM system then automatically tunes the accuracy requirements of individual computations and maps them to approximate hardware when possible. ApproxHPVM results show significant performance and energy improvements for popular deep learning benchmarks. Finally, we extend to ApproxHPVM to ApproxTuner, a compiler and runtime system for approximation. ApproxTuner extends ApproxHPVM with a wide range of hardware and software approximation techniques. It uses a three step approximation tuning strategy, a combination of development-time, install-time, and dynamic tuning. Our strategy ensures software portability, even though approximations have highly hardware-dependent performance, and enables efficient dynamic approximation tuning despite the expensive offline steps. ApproxTuner results show significant performance and energy improvements across 7 Deep Neural Networks and 3 image processing benchmarks, and ensures that high-level end-to-end quality specifications are satisfied during adaptive approximation tuning