14 research outputs found

    Linear analysis and optimization of stream programs

    Get PDF
    Thesis (M.Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2003.Includes bibliographical references (p. 123-127).As more complex DSP algorithms are realized in practice, there is an increasing need for high-level stream abstractions that can be compiled without sacrificing efficiency. Toward this end, we present a set of aggressive optimizations that target linear sections of a stream program. Our input language is StreamIt, which represents programs as a hierarchical graph of autonomous filters. A filter is linear if each of its outputs can be represented as an affine combination of its inputs. Linearity is common in DSP components; examples include FIR filters, expanders, compressors, FFTs and DCTs. We demonstrate that several algorithmic transformations, traditionally handtuned by DSP experts, can be completely automated by the compiler. First, we present a linear extraction analysis that automatically detects linear filters from the C-like code in their work function. Then, we give a procedure for combining adjacent linear filters into a single filter, a specialized caching strategy to remove redundant computations, and a method for translating a linear filter to operate in the frequency domain. We also present an optimization selection algorithm, which finds the sequence of combination and frequency transformations that yields the maximal benefit. We have completed a fully-automatic implementation of the above techniques as part of the StreamIt compiler. Using a suite of benchmarks, we show that our optimizations remove, on average, 86% of the floating point instructions required. In addition, we demonstrate an average execution time decrease of 450% and an 800% decrease in the best case.by Andrew Allinson Lamb.M.Eng

    Mapping Stream Programs into the Compressed Domain

    Get PDF
    Due to the high data rates involved in audio, video, and signalprocessing applications, it is imperative to compress the data todecrease the amount of storage used. Unfortunately, this implies thatany program operating on the data needs to be wrapped by adecompression and re-compression stage. Re-compression can incursignificant computational overhead, while decompression swamps theapplication with the original volume of data.In this paper, we present a program transformation that greatlyaccelerates the processing of compressible data. Given a program thatoperates on uncompressed data, we output an equivalent program thatoperates directly on the compressed format. Our transformationapplies to stream programs, a restricted but useful class ofapplications with regular communication and computation patterns. Ourformulation is based on LZ77, a lossless compression algorithm that isutilized by ZIP and fully encapsulates common formats such as AppleAnimation, Microsoft RLE, and Targa.We implemented a simple subset of our techniques in the StreamItcompiler, which emits executable plugins for two popular video editingtools: MEncoder and Blender. For common operations such as coloradjustment and video compositing, mapping into the compressed domainoffers a speedup roughly proportional to the overall compressionratio. For our benchmark suite of 12 videos in Apple Animationformat, speedups range from 1.1x to 471x, with a median of 15x

    System Support for Implicitly Parallel Programming

    Get PDF
    Coordinated Science Laboratory was formerly known as Control Systems Laborator

    Cache optimizations for stream programs

    Get PDF
    Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2005.Includes bibliographical references (leaves 73-75).As processor speeds continue to increase, the memory bottleneck remains a primary impediment to attaining performance. Effective use of the memory hierarchy can result in significant performance gains. This thesis focuses on a set of transformations that either reduce cache-miss rate or reduce the number of memory accesses for the class of streaming applications, which are becoming increasingly prevalent in embedded, desktop and high-performance processing. A fully automated optimization algorithm is presented that reduces the memory bottleneck for stream applications developed in the high-level stream programming language StreamIt. This thesis presents four memory optimizations: 1) cache aware fusion, which combines adjacent program components while respecting instruction and data cache constraints, 2) execution scaling, which judiciously repeats execution of program components to improve instruction and state locality, 3) scalar replacement, which converts certain data buffers into a sequence of scalar variables that can be register allocated, and 4) optimized buffer management, which reduces the overall number of memory accesses issued by the program. The cache aware fusion and execution scaling reduce the instruction and data cache-miss rates and are founded upon a simple and intuitive cache model that quantifies the temporal locality for a sequence of actor executions.(cont.) The scalar replacement and optimized buffer management reduce the number of memory accesses. An experimental evaluation of the memory optimizations is presented for three different architectures: StrongARM 1110, Pentium 3 and Itanium 2. Compared to unoptimized StreamIt code, the memory optimizations presented in this thesis yield a 257% speedup on the StrongARM, a 154% speedup on the Pentium 3, and a 152% speedup on Itanium 2. These numbers represent averages over our streaming benchmark suite. The most impressive speedups are demonstrated on an embedded processor StrongARM, which has only a single data and a single instruction cache, thus increasing the overall cost of memory operations and cache misses.by JÄnis. SermuliÅÅ¡.M.Eng

    Linear state-space analysis and optimization of StreamIt programs

    Get PDF
    Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2004.Includes bibliographical references (p. 95-97).The following thesis entails the construction, testing, modification, and analysis of two systems that couple sample ion introduction methods with a Differential Mobility Spectrometer (DMS). The sample ionization methods used with a custom designed interface for the DMS were Electrospray Ionization (ESI) and Atmospheric Pressure Matrix Assisted Laser Desorption Ionization (AP-MALDI). In addition to system development, Fourier transform and decision tree analyses were explored as alternatives to lead-cluster mapping and genetic algorithms for analyzing and classifying data produced by the systems for large biomolecules. Findings from testing and experiments using the prototype system have led to a second generation design of the interface. Results from data analysis have also provided new insights into different methods for classifying data whose form changes drastically for different sample introduction methods.by Sitij Agrawal.M.Eng

    A streaming computation framework for the Cell Processor

    Get PDF
    Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2007.Includes bibliographical references (p. 81-82).The Cell architecture is a heterogeneous, distributed-memory multicore architecture that features a novel on-chip communication network. Stream programs are particularly well-suited for execution on Cell.This thesis implements a runtime library on Cell specifically designed to support streaming applications and streaming language compilers. The runtime library abstracts the details of Cell's communication network and provides facilities that simplify the task of scheduling stream actors. The library is designed in the context of the StreamIt programming language. This library is used to implement a dynamic scheduling framework. The programmability of high-level schedulers with and without the library is analyzed. We observe that the library does not add significant overhead, provides a number of useful features for schedulers, and greatly simplifies the code required to implement schedulers.by Xin David Zhang.M.Eng

    A hybrid static/dynamic approach to scheduling stream programs

    Get PDF
    Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2009.Cataloged from PDF version of thesis.Includes bibliographical references (p. 95-97).Streaming languages such as Streamlt are often utilized to write stream programs that execute on multicore processors. Stream programs consist of actors that operate on streams of data. To execute on multiple cores, actors are scheduled for parallel execution while satisfying data dependencies between actors. In StreamIt, the compiler analyzes data dependencies between actors at compile-time and generates a static schedule that determines where and when actors are executed on the available cores. Statically scheduling actors onto cores results in no scheduling overhead at runtime and allows for sophisticated compile-time scheduling optimizations. Unfortunately, static scheduling has a number of severe limitations. The generated static schedule is inflexible and cannot be adapted to run-time conditions, such as cores that are unexpectedly unavailable. Static scheduling may also incorrectly load-balance cores due to inaccurate static work estimates. This thesis contributes a hybrid static/dynamic scheduling approach that attempts to address the limitations of static scheduling. Dynamic load-balancing is utilized to adjust the static schedule to run-time conditions and to correct load imbalances that might exist after static scheduling. Dynamic load-balancing is designed to add very little run-time overhead.by Ceryen Tan.M.Eng

    Compiler techniques for scalable performance of stream programs on multicore architectures

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2010.Cataloged from PDF version of thesis.Includes bibliographical references (p. 211-222).Given the ubiquity of multicore processors, there is an acute need to enable the development of scalable parallel applications without unduly burdening programmers. Currently, programmers are asked not only to explicitly expose parallelism but also concern themselves with issues of granularity, load-balancing, synchronization, and communication. This thesis demonstrates that when algorithmic parallelism is expressed in the form of a stream program, a compiler can effectively and automatically manage the parallelism. Our compiler assumes responsibility for low-level architectural details, transforming implicit algorithmic parallelism into a mapping that achieves scalable parallel performance for a given multicore target. Stream programming is characterized by regular processing of sequences of data, and it is a natural expression of algorithms in the areas of audio, video, digital signal processing, networking, and encryption. Streaming computation is represented as a graph of independent computation nodes that communicate explicitly over data channels. Our techniques operate on contiguous regions of the stream graph where the input and output rates of the nodes are statically determinable. Within a static region, the compiler first automatically adjusts the granularity and then exploits data, task, and pipeline parallelism in a holistic fashion. We introduce techniques that data-parallelize nodes that operate on overlapping sliding windows of their input, translating serializing state into minimal and parametrized inter-core communication. Finally, for nodes that cannot be data-parallelized due to state, we are the first to automatically apply software-pipelining techniques at a coarse granularity to exploit pipeline parallelism between stateful nodes. Our framework is evaluated in the context of the StreamIt programming language. StreamIt is a high-level stream programming language that has been shown to improve programmer productivity in implementing streaming algorithms. We employ the StreamIt Core benchmark suite of 12 real-world applications to demonstrate the effectiveness of our techniques for varying multicore architectures. For a 16-core distributed memory multicore, we achieve a 14.9x mean speedup. For benchmarks that include sliding-window computation, our sliding-window data-parallelization techniques are required to enable scalable performance for a 16-core SMP multicore (14x mean speedup) and a 64-core distributed shared memory multicore (52x mean speedup).by Michael I. Gordon.Ph.D

    A shared-memory multiprocessor system using the raw tiled architecture

    Get PDF
    Thesis (M. Eng. and S.B.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2004.Includes bibliographical references (p. 203-208).Recent trends in microprocessor design have moved away from dedicated hardware mechanisms to exposed architectures in which basic functionality is implemented in software. To demonstrate the flexibility of this scheme, I implement a shared memory system on Raw, a tiled multiprocessor. A traditional directory-based cache coherence system is used. The directories are fully resident on several tiles, and the remaining tiles act as users, with cache maintenance routines accessed via interrupt. Previous implementations of shared-memory systems have mostly relied on a combination of dedicated hardware and user-enabled software hooks, with one notable exception, Alewife, combining basic hardware with software support for corner cases. Here, the focus is moved to placing as much support for basic cases into software as possible. The system is designed to minimise custom hardware and user responsibilities. I prove the feasibility of such a design on an exposed architecture such as Raw.by Levente Jakab.M.Eng.and S.B
    corecore