545 research outputs found

    Packet Transactions: High-level Programming for Line-Rate Switches

    Full text link
    Many algorithms for congestion control, scheduling, network measurement, active queue management, security, and load balancing require custom processing of packets as they traverse the data plane of a network switch. To run at line rate, these data-plane algorithms must be in hardware. With today's switch hardware, algorithms cannot be changed, nor new algorithms installed, after a switch has been built. This paper shows how to program data-plane algorithms in a high-level language and compile those programs into low-level microcode that can run on emerging programmable line-rate switching chipsets. The key challenge is that these algorithms create and modify algorithmic state. The key idea to achieve line-rate programmability for stateful algorithms is the notion of a packet transaction : a sequential code block that is atomic and isolated from other such code blocks. We have developed this idea in Domino, a C-like imperative language to express data-plane algorithms. We show with many examples that Domino provides a convenient and natural way to express sophisticated data-plane algorithms, and show that these algorithms can be run at line rate with modest estimated die-area overhead.Comment: 16 page

    Better Loop Fusion for LMS

    Get PDF
    This is my master thesis done at PPL in Stanford under the supervision of Prof. Kunle Olukotun. It improved LMS, a framework for embedding DSLs (domain-specific languages) into Scala which features many general optimizations that can be used by any DSLs for free. I implemented a more powerful and cleaner version of the loop fusion optimization from the compiler world. Loop fusion is an important performance optimization for all languages that feature list comprehensions and translate their high-level operations into loop-based representations. It can decrease runtime, memory footprint and code size through two different fusion cases: The simpler one is called horizontal or side-by-side fusion and fuses adjacent loops iterating over the same range, enabling further optimizations. The second one is vertical or pipeline fusion, where a producer and a consumer of data are fused, removing the need for the intermediate data structure

    Compiler for statically scheduled message passing in parallel programs

    Get PDF
    Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2005.Includes bibliographical references (p. 95-96).Performance improvement in future microprocessors will rely more on the exploitation of parallelism than increases in clock frequency, leading to more multi-core and tiled processor architectures. Despite continuing research into parallelizing compilers, programming multiple instruction stream architectures remains difficult. This document describes C-Flow, a compiler system enabling statically-scheduled message passing between programs running on separate processors. When combined with statically-scheduled, low-latency networks like those in the MIT Raw processor, C-Flow provides the programmer with a simple but comprehensive messaging interface that can be used from high-level languages like C. The use of statically-scheduled messaging allows for fine-grained (single-word) messages that would be quite inefficient in the more traditional message passing systems used in cluster computers. Such fine-grained parallelism is possible because, as in systolic array machines, the network provides all of the necessary synchronization between tiles. On the Raw processor, C-Flow reduces development complexity by allowing the programmer to schedule static messages from a high-level language instead of using assembly code. C-Flow programs have been developed for arrays with 64 or more processor tiles and hve demonstrated performance within twenty percent of hand-optimized assembly.by Patrick Griffin.M.Eng

    VFC: The Vienna Fortran Compiler

    Get PDF

    PERFORMANCE OPTIMIZATION OF A STRUCTURED CFD CODE - GHOST ON COMMODITY CLUSTER ARCHITECTURES

    Get PDF
    This thesis focuses on optimizing the performance of an in-house, structured, 2D CFD code – GHOST, on commodity cluster architectures. The basic philosophy of the work is to optimize the cache usage of the code by implementing efficient coding techniques without changing the underlying numerical algorithm. Various optimization techniques that were implemented and the resulting changes in performance have been presented. Two techniques, external and internal blocking that were implemented earlier to tune the performance of this code have been reviewed. What follows is further tuning effort in order to circumvent the problems associated with using the blocking techniques. Later, to establish the universality of the optimization techniques, testing has been done on more complicated test case. All the techniques presented in this thesis have been tested on steady, laminar test cases. It has been proved that optimized versions of the code achieve better performances on variety of commodity cluster architectures chosen in this study

    Scaling non-regular shared-memory codes by reusing custom loop schedules

    Get PDF
    In this paper we explore the idea of customizing and reusing loop schedules to improve the scalability of non-regular numerical codes in shared-memory architectures with non-uniform memory access latency. The main objective is to implicitly setup affinity links between threads and data, by devising loop schedules that achieve balanced work distribution within irregular data spaces and reusing them as much as possible along the execution of the program for better memory access locality. This transformation provides a great deal of flexibility in optimizing locality, without compromising the simplicity of the shared-memory programming paradigm. In particular, the programmer does not need to explicitly distribute data between processors. The paper presents practical examples from real applications and experiments showing the efficiency of the approach.Peer ReviewedPostprint (author's final draft

    Abstraction Raising in General-Purpose Compilers

    Get PDF

    An OpenCL software compilation framework targeting an SoC-FPGA VLIW chip multiprocessor

    Get PDF
    Modern systems-on-chip augment their baseline CPU with coprocessors and accelerators to increase overall computational capability and power efficiency, and thus have evolved into heterogeneous multi-core systems. Several languages have been developed to enable this paradigm shift, including CUDA and OpenCL. This paper discusses a unified compilation environment to enable heterogeneous system design through the use of OpenCL and a highly configurable VLIW Chip Multiprocessor architecture known as the LE1. An LLVM compilation framework was researched and a prototype developed to enable the execution of OpenCL applications on a number of hardware configurations of the LE1 CMP. The presented OpenCL framework fully automates the compilation flow and supports work-item coalescing which better maps onto the ILP processor cores of the LE1 architecture. This paper discusses in detail both the software stack and target hardware architecture and evaluates the scalability of the proposed framework by running 12 industry-standard OpenCL benchmarks drawn from the AMD SDK and the Rodinia suites. The benchmarks are executed on 40 LE1 configurations with 10 implemented on an SoC-FPGA and the remaining on a cycle-accurate simulator. Across 12 OpenCL benchmarks results demonstrate near-linear wall-clock performance improvement of 1.8x (using 2 dual-issue cores), up to 5.2x (using 8 dual-issue cores) and on one case, super-linear improvement of 8.4x (FixOffset kernel, 8 dual-issue cores). The number of OpenCL benchmarks evaluated makes this study one of the most complete in the literature
    • …
    corecore