545 research outputs found
Packet Transactions: High-level Programming for Line-Rate Switches
Many algorithms for congestion control, scheduling, network measurement,
active queue management, security, and load balancing require custom processing
of packets as they traverse the data plane of a network switch. To run at line
rate, these data-plane algorithms must be in hardware. With today's switch
hardware, algorithms cannot be changed, nor new algorithms installed, after a
switch has been built.
This paper shows how to program data-plane algorithms in a high-level
language and compile those programs into low-level microcode that can run on
emerging programmable line-rate switching chipsets. The key challenge is that
these algorithms create and modify algorithmic state. The key idea to achieve
line-rate programmability for stateful algorithms is the notion of a packet
transaction : a sequential code block that is atomic and isolated from other
such code blocks. We have developed this idea in Domino, a C-like imperative
language to express data-plane algorithms. We show with many examples that
Domino provides a convenient and natural way to express sophisticated
data-plane algorithms, and show that these algorithms can be run at line rate
with modest estimated die-area overhead.Comment: 16 page
Better Loop Fusion for LMS
This is my master thesis done at PPL in Stanford under the supervision of Prof. Kunle Olukotun. It improved LMS, a framework for embedding DSLs (domain-specific languages) into Scala which features many general optimizations that can be used by any DSLs for free. I implemented a more powerful and cleaner version of the loop fusion optimization from the compiler world. Loop fusion is an important performance optimization for all languages that feature list comprehensions and translate their high-level operations into loop-based representations. It can decrease runtime, memory footprint and code size through two different fusion cases: The simpler one is called horizontal or side-by-side fusion and fuses adjacent loops iterating over the same range, enabling further optimizations. The second one is vertical or pipeline fusion, where a producer and a consumer of data are fused, removing the need for the intermediate data structure
Compiler for statically scheduled message passing in parallel programs
Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2005.Includes bibliographical references (p. 95-96).Performance improvement in future microprocessors will rely more on the exploitation of parallelism than increases in clock frequency, leading to more multi-core and tiled processor architectures. Despite continuing research into parallelizing compilers, programming multiple instruction stream architectures remains difficult. This document describes C-Flow, a compiler system enabling statically-scheduled message passing between programs running on separate processors. When combined with statically-scheduled, low-latency networks like those in the MIT Raw processor, C-Flow provides the programmer with a simple but comprehensive messaging interface that can be used from high-level languages like C. The use of statically-scheduled messaging allows for fine-grained (single-word) messages that would be quite inefficient in the more traditional message passing systems used in cluster computers. Such fine-grained parallelism is possible because, as in systolic array machines, the network provides all of the necessary synchronization between tiles. On the Raw processor, C-Flow reduces development complexity by allowing the programmer to schedule static messages from a high-level language instead of using assembly code. C-Flow programs have been developed for arrays with 64 or more processor tiles and hve demonstrated performance within twenty percent of hand-optimized assembly.by Patrick Griffin.M.Eng
PERFORMANCE OPTIMIZATION OF A STRUCTURED CFD CODE - GHOST ON COMMODITY CLUSTER ARCHITECTURES
This thesis focuses on optimizing the performance of an in-house, structured, 2D CFD code – GHOST, on commodity cluster architectures. The basic philosophy of the work is to optimize the cache usage of the code by implementing efficient coding techniques without changing the underlying numerical algorithm. Various optimization techniques that were implemented and the resulting changes in performance have been presented. Two techniques, external and internal blocking that were implemented earlier to tune the performance of this code have been reviewed. What follows is further tuning effort in order to circumvent the problems associated with using the blocking techniques. Later, to establish the universality of the optimization techniques, testing has been done on more complicated test case. All the techniques presented in this thesis have been tested on steady, laminar test cases. It has been proved that optimized versions of the code achieve better performances on variety of commodity cluster architectures chosen in this study
Scaling non-regular shared-memory codes by reusing custom loop schedules
In this paper we explore the idea of customizing and reusing loop schedules to improve the scalability of non-regular numerical codes in shared-memory architectures with non-uniform memory access latency. The main objective is to implicitly setup affinity links between threads and data, by devising loop schedules that achieve balanced work distribution within irregular data spaces and reusing them as much as possible along the execution of the program for better memory access locality. This transformation provides a great deal of flexibility in optimizing locality, without compromising the simplicity of the shared-memory programming paradigm. In particular, the programmer does not need to explicitly distribute data between processors. The paper presents practical examples from real applications and experiments showing the efficiency of the approach.Peer ReviewedPostprint (author's final draft
An OpenCL software compilation framework targeting an SoC-FPGA VLIW chip multiprocessor
Modern systems-on-chip augment their baseline CPU with coprocessors and accelerators to increase overall computational capability and power efficiency, and thus have evolved into heterogeneous multi-core systems. Several languages have been developed to enable this paradigm shift, including CUDA and OpenCL. This paper discusses a unified compilation environment to enable heterogeneous system design through the use of OpenCL and a highly configurable VLIW Chip Multiprocessor architecture known as the LE1. An LLVM compilation framework was researched and a prototype developed to enable the execution of OpenCL applications on a number of hardware configurations of the LE1 CMP. The presented OpenCL framework fully automates the compilation flow and supports work-item coalescing which better maps onto the ILP processor cores of the LE1 architecture. This paper discusses in detail both the software stack and target hardware architecture and evaluates the scalability of the proposed framework by running 12 industry-standard OpenCL benchmarks drawn from the AMD SDK and the Rodinia suites. The benchmarks are executed on 40 LE1 configurations with 10 implemented on an SoC-FPGA and the remaining on a cycle-accurate simulator. Across 12 OpenCL benchmarks results demonstrate near-linear wall-clock performance improvement of 1.8x (using 2 dual-issue cores), up to 5.2x (using 8 dual-issue cores) and on one case, super-linear improvement of 8.4x (FixOffset kernel, 8 dual-issue cores). The number of OpenCL benchmarks evaluated makes this study one of the most complete in the literature
Recommended from our members
A design representation model for high-level synthesis
Design tools share and exchange various types of information pertaining to the design. The identification of a uniform design representation to capture this information is essential for the development of a successful design environment. We have done an extensive study on the representation needs of existing database tools in the UCI CADLAB; examples of which are graph compilers for high-level hardware specifications, state schedulers, hardware allocators, and microarchitecture optimizers. The result of this study is the development of a design representation model that will serve as a common internal representation (DDM) for all system and behavioral synthesis tools. DDM thus builds the foundation for a CAD Framework in which design tools can communicate via operating on this common representation. The design information is composed of three separate graph models: the conceptual model, the behavioral model and the structural model. The conceptual model (represented by a Design Entity Graph) captures the overall organization of the design information, such as, versions and configurations. The behavioral model (represented by an Augmented Control/Data Flow Graph) describes the design behavior. The structural model (represented by an Annotated Component Graph) captures the hierarchical data path structure and its geometric information. In this paper, we define the last two graph models. They both capture the actual design data of the application domain. Since VHDL has gained increasing popularity as hardware description language for synthesis, we give numerous examples throughout this report that show how the proposed design representation model can be used to represent VHDL specifications
- …