600 research outputs found

    Building Efficient Query Engines in a High-Level Language

    Get PDF
    Abstraction without regret refers to the vision of using high-level programming languages for systems development without experiencing a negative impact on performance. A database system designed according to this vision offers both increased productivity and high performance, instead of sacrificing the former for the latter as is the case with existing, monolithic implementations that are hard to maintain and extend. In this article, we realize this vision in the domain of analytical query processing. We present LegoBase, a query engine written in the high-level language Scala. The key technique to regain efficiency is to apply generative programming: LegoBase performs source-to-source compilation and optimizes the entire query engine by converting the high-level Scala code to specialized, low-level C code. We show how generative programming allows to easily implement a wide spectrum of optimizations, such as introducing data partitioning or switching from a row to a column data layout, which are difficult to achieve with existing low-level query compilers that handle only queries. We demonstrate that sufficiently powerful abstractions are essential for dealing with the complexity of the optimization effort, shielding developers from compiler internals and decoupling individual optimizations from each other. We evaluate our approach with the TPC-H benchmark and show that: (a) With all optimizations enabled, LegoBase significantly outperforms a commercial database and an existing query compiler. (b) Programmers need to provide just a few hundred lines of high-level code for implementing the optimizations, instead of complicated low-level code that is required by existing query compilation approaches. (c) The compilation overhead is low compared to the overall execution time, thus making our approach usable in practice for compiling query engines

    Dynamic task scheduling and binding for many-core systems through stream rewriting

    Get PDF
    This thesis proposes a novel model of computation, called stream rewriting, for the specification and implementation of highly concurrent applications. Basically, the active tasks of an application and their dependencies are encoded as a token stream, which is iteratively modified by a set of rewriting rules at runtime. In order to estimate the performance and scalability of stream rewriting, a large number of experiments have been evaluated on many-core systems and the task management has been implemented in software and hardware.In dieser Dissertation wurde Stream Rewriting als eine neue Methode entwickelt, um Anwendungen mit einer großen Anzahl von dynamischen Tasks zu beschreiben und effizient zur Laufzeit verwalten zu können. Dabei werden die aktiven Tasks in einem Datenstrom verpackt, der zur Laufzeit durch wiederholtes Suchen und Ersetzen umgeschrieben wird. Um die Performance und Skalierbarkeit zu bestimmen, wurde eine Vielzahl von Experimenten mit Many-Core-Systemen durchgeführt und die Verwaltung von Tasks über Stream Rewriting in Software und Hardware implementiert

    A multi-level functional IR with rewrites for higher-level synthesis of accelerators

    Get PDF
    Specialised accelerators deliver orders of magnitude higher energy-efficiency than general-purpose processors. Field Programmable Gate Arrays (FPGAs) have become the substrate of choice, because the ever-changing nature of modern workloads, such as machine learning, demands reconfigurability. However, they are notoriously hard to program directly using Hardware Description Languages (HDLs). Traditional High-Level Synthesis (HLS) tools improve productivity, but come with their own problems. They often produce sub-optimal designs and programmers are still required to write hardware-specific code, thus development cycles remain long. This thesis proposes Shir, a higher-level synthesis approach for high-performance accelerator design with a hardware-agnostic programming entry point, a multi-level Intermediate Representation (IR), a compiler and rewrite rules for optimisation. First, a novel, multi-level functional IR structure for accelerator design is described. The IRs operate on different levels of abstraction, cleanly separating different hardware concerns. They enable the expression of different forms of parallelism and standard memory features, such as asynchronous off-chip memories or synchronous on-chip buffers, as well as arbitration of such shared resources. Exposing these features at the IR level is essential for achieving high performance. Next, mechanical lowering procedures are introduced to automatically compile a program specification through Shir’s functional IRs until low-level HDL code for FPGA synthesis is emitted. Each lowering step gradually adds implementation details. Finally, this thesis presents rewrite rules for automatic optimisations around parallelisation, buffering and data reshaping. Reshaping operations pose a challenge to functional approaches in particular. They introduce overheads that compromise performance or even prevent the generation of synthesisable hardware designs altogether. This fundamental issue is solved by the application of rewrite rules. The viability of this approach is demonstrated by running matrix multiplication and 2D convolution on an Intel Arria 10 FPGA. A limited design space exploration is conducted, confirming the ability of the IR to exploit various hardware features. Using rewrite rules for optimisation, it is possible to generate high-performance designs that are competitive with highly tuned OpenCL implementations and that outperform hardware-agnostic OpenCL code. The performance impact of the optimisations is further evaluated showing that they are essential to achieving high performance, and in many cases also necessary to produce hardware that fits the resource constraints

    A computing origami: Optimized code generation for emerging parallel platforms

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Mitigating security and privacy threats from untrusted application components on Android

    Get PDF
    Aufgrund von Androids datenzentrierter und Open-Source Natur sowie von fehlerhaften/bösartigen Apps durch das lockere Marktzulassungsverfahren, ist die Privatsphäre von Benutzern besonders gefährdet. Diese Dissertation präsentiert eine Reihe von Forschungsarbeiten, die die Bedrohung der Sicherheit/Privatsphäre durch nicht vertrauenswürdige Appkomponenten mindern. Die erste Arbeit stellt eine Compiler-basierte Kompartmentalisierungslösung vor, die Privilegientrennung nutzt, um eine starke Barriere zwischen der Host-App und Bibliothekskomponenten zu etablieren, und somit sensible Daten vor der Kompromittierung durch neugierige/bösartige Werbe-Bibliotheken schützt. Für fehleranfällige Bibliotheken von Drittanbietern implementieren wir in der zweiten Arbeit ein auf API-Kompatibilität basierendes Bibliothek-Update-Framework, das veraltete Bibliotheken durch Drop-Ins aktualisiert, um das durch Bibliotheken verursachte Zeitfenster der Verwundbarkeit zu minimieren. Die neueste Arbeit untersucht die missbräuchliche Nutzung von privilegierten Accessibility(a11y)-Funktionen in bösartigen Apps. Wir zeigen ein datenschutzfreundliches a11y-Framework, das die a11y-Logik wie eine Pipeline behandelt, die aus mehreren Modulen besteht, die in verschiedenen Sandboxen laufen. Weiterhin erzwingen wir eine Flusskontrolle über die Kommunikation zwischen den Modulen, wodurch die Angriffsfläche für den Missbrauch von a11y-APIs verringert wird, während die Vorteile von a11y erhalten bleiben.While Android’s data-intensive and open-source nature, combined with its less-than-strict market approval process, has allowed the installation of flawed and even malicious apps, its coarse-grained security model and update bottleneck in the app ecosystem make the platform’s privacy and security situation more worrying. This dissertation introduces a line of works that mitigate privacy and security threats from untrusted app components. The first work presents a compiler-based library compartmentalization solution that utilizes privilege separation to establish a strong trustworthy boundary between the host app and untrusted lib components, thus protecting sensitive user data from being compromised by curious or malicious ad libraries. While for vulnerable third-party libraries, we then build the second work that implements an API-compatibility-based library update framework using drop-in replacements of outdated libraries to minimize the open vulnerability window caused by libraries and we perform multiple dynamic tests and case studies to investigate its feasibility. Our latest work focuses on the misusing of powerful accessibility (a11y) features in untrusted apps. We present a privacy-enhanced a11y framework that treats the a11y logic as a pipeline composed of multiple modules running in different sandboxes. We further enforce flow control over the communication between modules, thus reducing the attack surface from abusing a11y APIs while preserving the a11y benefits

    Doctor of Philosophy

    Get PDF
    dissertationTrusted computing base (TCB) of a computer system comprises components that must be trusted in order to support its security policy. Research communities have identified the well-known minimal TCB principle, namely, the TCB of a system should be as small as possible, so that it can be thoroughly examined and verified. This dissertation is an experiment showing how small the TCB for an isolation service is based on software fault isolation (SFI) for small multitasking embedded systems. The TCB achieved by this dissertation includes just the formal definitions of isolation properties, instruction semantics, program logic, and a proof assistant, besides hardware. There is not a compiler, an assembler, a verifier, a rewriter, or an operating system in the TCB. To the best of my knowledge, this is the smallest TCB that has ever been shown for guaranteeing nontrivial properties of real binary programs on real hardware. This is accomplished by combining SFI techniques and high-confidence formal verification. An SFI implementation inserts dynamic checks before dangerous operations, and these checks provide necessary invariants needed by the formal verification to prove theorems about the isolation properties of ARM binary programs. The high-confidence assurance of the formal verification comes from two facts. First, the verification is based on an existing realistic semantics of the ARM ISA that is independently developed by Cambridge researchers. Second, the verification is conducted in a higher-order proof assistant-the HOL theorem prover, which mechanically checks every verification step by rigorous logic. In addition, the entire verification process, including both specification generation and verification, is automatic. To support proof automation, a novel program logic has been designed, and an automatic reasoning framework for verifying shallow safety properties has been developed. The program logic integrates Hoare-style reasoning and Floyd's inductive assertion reasoning together in a small set of definitions, which overcomes shortcomings of Hoare logic and facilitates proof automation. All inference rules of the logic are proven based on the instruction semantics and the logic definitions. The framework leverages abstract interpretation to automatically find function specifications required by the program logic. The results of the abstract interpretation are used to construct the function specifications automatically, and the specifications are proven without human interaction by utilizing intermediate theorems generated during the abstract interpretation. All these work in concert to create the very small TCB

    Lightweight Modular Staging and Embedded Compilers:Abstraction without Regret for High-Level High-Performance Programming

    Get PDF
    Programs expressed in a high-level programming language need to be translated to a low-level machine dialect for execution. This translation is usually accomplished by a compiler, which is able to translate any legal program to equivalent low-level code. But for individual source programs, automatic translation does not always deliver good results: Software engineering practice demands generalization and abstraction, whereas high performance demands specialization and concretization. These goals are at odds, and compilers can only rarely translate expressive high-level programs tomodern hardware platforms in a way that makes best use of the available resources. Explicit program generation is a promising alternative to fully automatic translation. Instead of writing down the program and relying on a compiler for translation, developers write a program generator, which produces a specialized, efficient, low-level program as its output. However, developing high-quality program generators requires a very large effort that is often hard to amortize. In this thesis, we propose a hybrid design: Integrate compilers into programs so that programs can take control of the translation process, but rely on libraries of common compiler functionality for help. We present Lightweight Modular Staging (LMS), a generative programming approach that lowers the development effort significantly. LMS combines program generator logic with the generated code in a single program, using only types to distinguish the two stages of execution. Through extensive use of component technology, LMS makes a reusable and extensible compiler framework available at the library level, allowing programmers to tightly integrate domain-specific abstractions and optimizations into the generation process, with common generic optimizations provided by the framework. Compared to previous work on programgeneration, a key aspect of our design is the use of staging not only as a front-end, but also as a way to implement internal compiler passes and optimizations, many of which can be combined into powerful joint simplification passes. LMS is well suited to develop embedded domain specific languages (DSLs) and has been used to develop powerful performance-oriented DSLs for demanding domains such as machine learning, with code generation for heterogeneous platforms including GPUs. LMS has also been used to generate SQL for embedded database queries and JavaScript for web applications

    SPICE²: A Spatial, Parallel Architecture for Accelerating the Spice Circuit Simulator

    Get PDF
    Spatial processing of sparse, irregular floating-point computation using a single FPGA enables up to an order of magnitude speedup (mean 2.8X speedup) over a conventional microprocessor for the SPICE circuit simulator. We deliver this speedup using a hybrid parallel architecture that spatially implements the heterogeneous forms of parallelism available in SPICE. We decompose SPICE into its three constituent phases: Model-Evaluation, Sparse Matrix-Solve, and Iteration Control and parallelize each phase independently. We exploit data-parallel device evaluations in the Model-Evaluation phase, sparse dataflow parallelism in the Sparse Matrix-Solve phase and compose the complete design in streaming fashion. We name our parallel architecture SPICE²: Spatial Processors Interconnected for Concurrent Execution for accelerating the SPICE circuit simulator. We program the parallel architecture with a high-level, domain-specific framework that identifies, exposes and exploits parallelism available in the SPICE circuit simulator. This design is optimized with an auto-tuner that can scale the design to use larger FPGA capacities without expert intervention and can even target other parallel architectures with the assistance of automated code-generation. This FPGA architecture is able to outperform conventional processors due to a combination of factors including high utilization of statically-scheduled resources, low-overhead dataflow scheduling of fine-grained tasks, and overlapped processing of the control algorithms. We demonstrate that we can independently accelerate Model-Evaluation by a mean factor of 6.5X(1.4--23X) across a range of non-linear device models and Matrix-Solve by 2.4X(0.6--13X) across various benchmark matrices while delivering a mean combined speedup of 2.8X(0.2--11X) for the two together when comparing a Xilinx Virtex-6 LX760 (40nm) with an Intel Core i7 965 (45nm). With our high-level framework, we can also accelerate Single-Precision Model-Evaluation on NVIDIA GPUs, ATI GPUs, IBM Cell, and Sun Niagara 2 architectures. We expect approaches based on exploiting spatial parallelism to become important as frequency scaling slows down and modern processing architectures turn to parallelism (\eg multi-core, GPUs) due to constraints of power consumption. This thesis shows how to express, exploit and optimize spatial parallelism for an important class of problems that are challenging to parallelize.</p
    corecore