98 research outputs found

    Application-Specific Memory Subsystems

    Get PDF
    The disparity in performance between processors and main memories has led computer architects to incorporate large cache hierarchies in modern computers. These cache hierarchies are designed to be general-purpose in that they strive to provide the best possible performance across a wide range of applications. However, such a memory subsystem does not necessarily provide the best possible performance for a particular application. Although general-purpose memory subsystems are desirable when the work-load is unknown and the memory subsystem must remain fixed, when this is not the case a custom memory subsystem may be beneficial. For example, in an application-specific integrated circuit (ASIC) or a field-programmable gate array (FPGA) designed to run a particular application, a custom memory subsystem optimized for that application would be desirable. In addition, when there are tunable parameters in the memory subsystem, it may make sense to change these parameters depending on the application being run. Such a situation arises today with FPGAs and, to a lesser extent, GPUs, and it is plausible that general-purpose computers will begin to support greater flexibility in the memory subsystem in the future. In this dissertation, we first show that it is possible to create application-specific memory subsystems that provide much better performance than a general-purpose memory subsystem. In addition, we show a way to discover such memory subsystems automatically using a superoptimization technique on memory address traces gathered from applications. This allows one to generate a custom memory subsystem with little effort. We next show that our memory subsystem superoptimization technique can be used to optimize for objectives other than performance. As an example, we show that it is possible to reduce the number of writes to the main memory, which can be useful for main memories with limited write durability, such as flash or Phase-Change Memory (PCM). Finally, we show how to superoptimize memory subsystems for streaming applications, which are a class of parallel applications. In particular, we show that, through the use of ScalaPipe, we can author and deploy streaming applications targeting FPGAs with superoptimized memory subsystems. ScalaPipe is a domain-specific language (DSL) embedded in the Scala programming language for generating streaming applications that can be implemented on CPUs and FPGAs. Using the ScalaPipe implementation, we are able to demonstrate actual performance improvements using the superoptimized memory subsystem with applications implemented in hardware

    Superoptimization of WebAssembly Bytecode

    Full text link
    Motivated by the fast adoption of WebAssembly, we propose the first functional pipeline to support the superoptimization of WebAssembly bytecode. Our pipeline works over LLVM and Souper. We evaluate our superoptimization pipeline with 12 programs from the Rosetta code project. Our pipeline improves the code section size of 8 out of 12 programs. We discuss the challenges faced in superoptimization of WebAssembly with two case studies.Comment: 4 pages, 3 figures. Proceedings of MoreVMs: Workshop on Modern Language Runtimes, Ecosystems, and VMs (2020

    Unbounded Superoptimization

    Get PDF
    Our aim is to enable software to take full advantage of the capabilities of emerging microprocessor designs without modifying the compiler. Towards this end, we propose a new approach to code generation and optimization. Our approach uses an SMT solver in a novel way to generate efficient code for modern architectures and guarantee that the generated code correctly implements the source code. The distinguishing characteristic of our approach is that the size of the constraints does not depend on the candidate sequence of instructions. To study the feasibility of our approach, we implemented a preliminary prototype, which takes as input LLVM IR code and uses Z3 SMT solver to generate ARMv7-A assembly. The prototype handles arbitrary loop-free code (not only basic blocks) as input and output. We applied it to small but tricky examples used as standard benchmarks for other superoptimization and synthesis tools. We are encouraged to see that Z3 successfully solved complex constraints that arise from our approach. This work paves the way to employing recent advances in SMT solvers and has a potential to advance SMT solvers further by providing a new category of challenging benchmarks that come from an industrial application domain

    Adaptive Neural Compilation

    Full text link
    This paper proposes an adaptive neural-compilation framework to address the problem of efficient program learning. Traditional code optimisation strategies used in compilers are based on applying pre-specified set of transformations that make the code faster to execute without changing its semantics. In contrast, our work involves adapting programs to make them more efficient while considering correctness only on a target input distribution. Our approach is inspired by the recent works on differentiable representations of programs. We show that it is possible to compile programs written in a low-level language to a differentiable representation. We also show how programs in this representation can be optimised to make them efficient on a target distribution of inputs. Experimental results demonstrate that our approach enables learning specifically-tuned algorithms for given data distributions with a high success rate.Comment: Submitted to NIPS 2016, code and supplementary materials will be available on author's pag
    • …
    corecore