Search CORE

545 research outputs found

Packet Transactions: High-level Programming for Line-Rate Switches

Author: Alizadeh Mohammad
Balakrishnan Hari
Budiu Mihai
Cheung Alvin
Kim Changhoon
Licking Steve
McKeown Nick
Sivaraman Anirudh
Varghese George
Publication venue
Publication date: 29/01/2016
Field of study

Many algorithms for congestion control, scheduling, network measurement, active queue management, security, and load balancing require custom processing of packets as they traverse the data plane of a network switch. To run at line rate, these data-plane algorithms must be in hardware. With today's switch hardware, algorithms cannot be changed, nor new algorithms installed, after a switch has been built. This paper shows how to program data-plane algorithms in a high-level language and compile those programs into low-level microcode that can run on emerging programmable line-rate switching chipsets. The key challenge is that these algorithms create and modify algorithmic state. The key idea to achieve line-rate programmability for stateful algorithms is the notion of a packet transaction : a sequential code block that is atomic and isolated from other such code blocks. We have developed this idea in Domino, a C-like imperative language to express data-plane algorithms. We show with many examples that Domino provides a convenient and natural way to express sophisticated data-plane algorithms, and show that these algorithms can be run at line rate with modest estimated die-area overhead.Comment: 16 page

arXiv.org e-Print Archive

DSpace@MIT

Better Loop Fusion for LMS

Author: Salvisberg Véra
Publication venue
Publication date: 12/06/2015
Field of study

This is my master thesis done at PPL in Stanford under the supervision of Prof. Kunle Olukotun. It improved LMS, a framework for embedding DSLs (domain-specific languages) into Scala which features many general optimizations that can be used by any DSLs for free. I implemented a more powerful and cleaner version of the loop fusion optimization from the compiler world. Loop fusion is an important performance optimization for all languages that feature list comprehensions and translate their high-level operations into loop-based representations. It can decrease runtime, memory footprint and code size through two different fusion cases: The simpler one is called horizontal or side-by-side fusion and fuses adjacent loops iterating over the same range, enabling further optimizations. The second one is vertical or pipeline fusion, where a producer and a consumer of data are fused, removing the need for the intermediate data structure

Infoscience - École polytechnique fédérale de Lausanne

Compiler for statically scheduled message passing in parallel programs

Author: Griffin Patrick (Patrick Robert)
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/2005
Field of study

Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2005.Includes bibliographical references (p. 95-96).Performance improvement in future microprocessors will rely more on the exploitation of parallelism than increases in clock frequency, leading to more multi-core and tiled processor architectures. Despite continuing research into parallelizing compilers, programming multiple instruction stream architectures remains difficult. This document describes C-Flow, a compiler system enabling statically-scheduled message passing between programs running on separate processors. When combined with statically-scheduled, low-latency networks like those in the MIT Raw processor, C-Flow provides the programmer with a simple but comprehensive messaging interface that can be used from high-level languages like C. The use of statically-scheduled messaging allows for fine-grained (single-word) messages that would be quite inefficient in the more traditional message passing systems used in cluster computers. Such fine-grained parallelism is possible because, as in systolic array machines, the network provides all of the necessary synchronization between tiles. On the Raw processor, C-Flow reduces development complexity by allowing the programmer to schedule static messages from a high-level language instead of using assembly code. C-Flow programs have been developed for arrays with 64 or more processor tiles and hve demonstrated performance within twenty percent of hand-optimized assembly.by Patrick Griffin.M.Eng

DSpace@MIT

VFC: The Vienna Fortran Compiler

Author
Publication venue: 'Hindawi Limited'
Publication date: 01/01/1999
Field of study

Crossref

PERFORMANCE OPTIMIZATION OF A STRUCTURED CFD CODE - GHOST ON COMMODITY CLUSTER ARCHITECTURES

Author: Kristipati Pavan K.
Publication venue: UKnowledge
Publication date: 01/01/2008
Field of study

This thesis focuses on optimizing the performance of an in-house, structured, 2D CFD code – GHOST, on commodity cluster architectures. The basic philosophy of the work is to optimize the cache usage of the code by implementing efficient coding techniques without changing the underlying numerical algorithm. Various optimization techniques that were implemented and the resulting changes in performance have been presented. Two techniques, external and internal blocking that were implemented earlier to tune the performance of this code have been reviewed. What follows is further tuning effort in order to circumvent the problems associated with using the blocking techniques. Later, to establish the universality of the optimization techniques, testing has been done on more complicated test case. All the techniques presented in this thesis have been tested on steady, laminar test cases. It has been proved that optimized versions of the code achieve better performances on variety of commodity cluster architectures chosen in this study

University of Kentucky

Scaling non-regular shared-memory codes by reusing custom loop schedules

Author: Artiaga Amouroux Ernest
Ayguadé Parra Eduard
Labarta Mancho Jesús José
Nikolopoulos Dimitrios
Publication venue: 'Hindawi Limited'
Publication date: 01/06/2003
Field of study

In this paper we explore the idea of customizing and reusing loop schedules to improve the scalability of non-regular numerical codes in shared-memory architectures with non-uniform memory access latency. The main objective is to implicitly setup affinity links between threads and data, by devising loop schedules that achieve balanced work distribution within irregular data spaces and reusing them as much as possible along the execution of the program for better memory access locality. This transformation provides a great deal of flexibility in optimizing locality, without compromising the simplicity of the shared-memory programming paradigm. In particular, the programmer does not need to explicitly distribute data between processors. The paper presents practical examples from real applications and experiments showing the efficiency of the approach.Peer ReviewedPostprint (author's final draft

UPCommons. Portal del coneixement obert de la UPC

Directory of Open Access Journals

Abstraction Raising in General-Purpose Compilers

Author: Chelini Lorenzo
Publication venue: Eindhoven University of Technology
Publication date: 31/08/2021
Field of study

Pure OAI Repository

An OpenCL software compilation framework targeting an SoC-FPGA VLIW chip multiprocessor

Author: Samuel J. Parker (7203041)
Vassilios Chouliaras (1251600)
Publication venue
Publication date: 01/01/2016
Field of study

Modern systems-on-chip augment their baseline CPU with coprocessors and accelerators to increase overall computational capability and power efficiency, and thus have evolved into heterogeneous multi-core systems. Several languages have been developed to enable this paradigm shift, including CUDA and OpenCL. This paper discusses a unified compilation environment to enable heterogeneous system design through the use of OpenCL and a highly configurable VLIW Chip Multiprocessor architecture known as the LE1. An LLVM compilation framework was researched and a prototype developed to enable the execution of OpenCL applications on a number of hardware configurations of the LE1 CMP. The presented OpenCL framework fully automates the compilation flow and supports work-item coalescing which better maps onto the ILP processor cores of the LE1 architecture. This paper discusses in detail both the software stack and target hardware architecture and evaluates the scalability of the proposed framework by running 12 industry-standard OpenCL benchmarks drawn from the AMD SDK and the Rodinia suites. The benchmarks are executed on 40 LE1 configurations with 10 implemented on an SoC-FPGA and the remaining on a cycle-accurate simulator. Across 12 OpenCL benchmarks results demonstrate near-linear wall-clock performance improvement of 1.8x (using 2 dual-issue cores), up to 5.2x (using 8 dual-issue cores) and on one case, super-linear improvement of 8.4x (FixOffset kernel, 8 dual-issue cores). The number of OpenCL benchmarks evaluated makes this study one of the most complete in the literature

Loughborough University Institutional Repository

Recommended from our members

A design representation model for high-level synthesis

Author: Gajski Daniel D.
Rundensteiner Elke A.
Publication venue: eScholarship, University of California
Publication date: 01/01/1990
Field of study

Design tools share and exchange various types of information pertaining to the design. The identification of a uniform design representation to capture this information is essential for the development of a successful design environment. We have done an extensive study on the representation needs of existing database tools in the UCI CADLAB; examples of which are graph compilers for high-level hardware specifications, state schedulers, hardware allocators, and microarchitecture optimizers. The result of this study is the development of a design representation model that will serve as a common internal representation (DDM) for all system and behavioral synthesis tools. DDM thus builds the foundation for a CAD Framework in which design tools can communicate via operating on this common representation. The design information is composed of three separate graph models: the conceptual model, the behavioral model and the structural model. The conceptual model (represented by a Design Entity Graph) captures the overall organization of the design information, such as, versions and configurations. The behavioral model (represented by an Augmented Control/Data Flow Graph) describes the design behavior. The structural model (represented by an Annotated Component Graph) captures the hierarchical data path structure and its geometric information. In this paper, we define the last two graph models. They both capture the actual design data of the application domain. Since VHDL has gained increasing popularity as hardware description language for synthesis, we give numerous examples throughout this report that show how the proposed design representation model can be used to represent VHDL specifications

eScholarship - University of California