508 research outputs found

    A Comparative Study of Scheduling Techniques for Multimedia Applications on SIMD Pipelines

    Full text link
    Parallel architectures are essential in order to take advantage of the parallelism inherent in streaming applications. One particular branch of these employ hardware SIMD pipelines. In this paper, we analyse several scheduling techniques, namely ad hoc overlapped execution, modulo scheduling and modulo scheduling with unrolling, all of which aim to efficiently utilize the special architecture design. Our investigation focuses on improving throughput while analysing other metrics that are important for streaming applications, such as register pressure, buffer sizes and code size. Through experiments conducted on several media benchmarks, we present and discuss trade-offs involved when selecting any one of these scheduling techniques.Comment: Presented at DATE Friday Workshop on Heterogeneous Architectures and Design Methods for Embedded Image Systems (HIS 2015) (arXiv:1502.07241

    Skalabilna implementacija dekodera po normi MPEG korištenjem tokovnog programskog jezika

    Get PDF
    In this paper, we describe a scalable and portable parallelized implementation of a MPEG decoder using a streaming computation paradigm, tailored to new generations of multi--core systems. A novel, hybrid approach towards parallelization of both new and legacy applications is described, where only data--intensive and performance--critical parts are implemented in the streaming domain. An architecture--independent \u27StreamIt\u27 language is used for design, optimization and implementation of parallelized segments, while the developed \u27StreamGate\u27 interface provides a communication mechanism between the implementation domains. The proposed hybrid approach was employed in re--factoring of a reference MPEG video decoder implementation; identifying the most performance--critical segments and re-implementing them in \u27StreamIt\u27 language, with \u27StreamGate\u27 interface as a communication mechanism between the host and streaming kernel. We evaluated the scalability of the decoder with respect to the number of cores, video frame formats, sizes and decomposition. Decoder performance was examined in the presence of different processor load configurations and with respect to the number of simultaneously processed frames.U ovom radu opisujemo skalabilnu i prenosivu implementaciju dekodera po normi MPEG ostvarenu korištenjem paradigme tokovnog raÄŤunarstva, prilagoÄ‘enu novim generacijama višejezgrenih raÄŤunala. Opisan je novi, hibridni pristup paralelizaciji novih ili postojećih aplikacija, gdje se samo podatkovno intenzivni i raÄŤunski zahtjevni dijelovi implementiraju u tokovnoj domeni. Arhitekturno neovisni jezik StreamIt koristi se za oblikovanje, optimiranje i izvedbu paraleliziranih segmenata aplikacije, dok razvijeno suÄŤelje \u27StreamGate\u27 omogućava komunikaciju izmeÄ‘u domena implementacije. PredloĹľeni hibridni pristup razvoju paraleliziranih aplikacija iskorišten je u preoblikovanju referentnog dekodera video zapisa po normi MPEG; identificirani su raÄŤunski zahtjevni segmenti aplikacije i ponovno implementirani u jeziku StreamIt, sa suÄŤeljem \u27StreamGate\u27 kao poveznicom izmeÄ‘u slijedne i tokovne domene. Ispitivana su svojstva skalabilnosti s obzirom na ciljani broj jezgri, format video zapisa i veliÄŤinu okvira te dekompoziciju ulaznih podataka. Svojstva dekodera  su praćena u prisustvu razliÄŤitih opterećenja ispitnog raÄŤunala, i s obzirom na broj istovremeno obraÄ‘ivanih okvira

    Compiler techniques for scalable performance of stream programs on multicore architectures

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2010.Cataloged from PDF version of thesis.Includes bibliographical references (p. 211-222).Given the ubiquity of multicore processors, there is an acute need to enable the development of scalable parallel applications without unduly burdening programmers. Currently, programmers are asked not only to explicitly expose parallelism but also concern themselves with issues of granularity, load-balancing, synchronization, and communication. This thesis demonstrates that when algorithmic parallelism is expressed in the form of a stream program, a compiler can effectively and automatically manage the parallelism. Our compiler assumes responsibility for low-level architectural details, transforming implicit algorithmic parallelism into a mapping that achieves scalable parallel performance for a given multicore target. Stream programming is characterized by regular processing of sequences of data, and it is a natural expression of algorithms in the areas of audio, video, digital signal processing, networking, and encryption. Streaming computation is represented as a graph of independent computation nodes that communicate explicitly over data channels. Our techniques operate on contiguous regions of the stream graph where the input and output rates of the nodes are statically determinable. Within a static region, the compiler first automatically adjusts the granularity and then exploits data, task, and pipeline parallelism in a holistic fashion. We introduce techniques that data-parallelize nodes that operate on overlapping sliding windows of their input, translating serializing state into minimal and parametrized inter-core communication. Finally, for nodes that cannot be data-parallelized due to state, we are the first to automatically apply software-pipelining techniques at a coarse granularity to exploit pipeline parallelism between stateful nodes. Our framework is evaluated in the context of the StreamIt programming language. StreamIt is a high-level stream programming language that has been shown to improve programmer productivity in implementing streaming algorithms. We employ the StreamIt Core benchmark suite of 12 real-world applications to demonstrate the effectiveness of our techniques for varying multicore architectures. For a 16-core distributed memory multicore, we achieve a 14.9x mean speedup. For benchmarks that include sliding-window computation, our sliding-window data-parallelization techniques are required to enable scalable performance for a 16-core SMP multicore (14x mean speedup) and a 64-core distributed shared memory multicore (52x mean speedup).by Michael I. Gordon.Ph.D

    Parallelization of dynamic programming recurrences in computational biology

    Get PDF
    The rapid growth of biosequence databases over the last decade has led to a performance bottleneck in the applications analyzing them. In particular, over the last five years DNA sequencing capacity of next-generation sequencers has been doubling every six months as costs have plummeted. The data produced by these sequencers is overwhelming traditional compute systems. We believe that in the future compute performance, not sequencing, will become the bottleneck in advancing genome science. In this work, we investigate novel computing platforms to accelerate dynamic programming algorithms, which are popular in bioinformatics workloads. We study algorithm-specific hardware architectures that exploit fine-grained parallelism in dynamic programming kernels using field-programmable gate arrays: FPGAs). We advocate a high-level synthesis approach, using the recurrence equation abstraction to represent dynamic programming and polyhedral analysis to exploit parallelism. We suggest a novel technique within the polyhedral model to optimize for throughput by pipelining independent computations on an array. This design technique improves on the state of the art, which builds latency-optimal arrays. We also suggest a method to dynamically switch between a family of designs using FPGA reconfiguration to achieve a significant performance boost. We have used polyhedral methods to parallelize the Nussinov RNA folding algorithm to build a family of accelerators that can trade resources for parallelism and are between 15-130x faster than a modern dual core CPU implementation. A Zuker RNA folding accelerator we built on a single workstation with four Xilinx Virtex 4 FPGAs outperforms 198 3 GHz Intel Core 2 Duo processors. Furthermore, our design running on a single FPGA is an order of magnitude faster than competing implementations on similar-generation FPGAs and graphics processors. Our work is a step toward the goal of automated synthesis of hardware accelerators for dynamic programming algorithms

    Streamroller : A Unified Compilation and Synthesis System for Streaming Applications.

    Full text link
    The growing complexity of applications has increased the need for higher processing power. In the embedded domain, the convergence of audio, video, and networking on a handheld device has prompted the need for low cost, low power,and high performance implementations of these applications in the form of custom hardware. In a more mainstream domain like gaming consoles, the move towards more realism in physics simulations and graphics has forced the industry towards multicore systems. Many of the applications in these domains are streaming in nature. The key challenge is to get efficient implementations of custom hardware from these applications and map these applications efficiently onto multicore architectures. This dissertation presents a unified methodology, referred to as Streamroller, that can be applied for the problem of scheduling stream programs to multicore architectures and to the problem of automatic synthesis of custom hardware for stream applications. Firstly, a method called stream-graph modulo scheduling is presented, which maps stream programs effectively onto a multicore architecture. Many aspects of a real system, like limited memory and explicit DMAs are modeled in the scheduler. The scheduler is evaluated for a set of stream programs on IBM's Cell processor. Secondly, an automated high-level synthesis system for creating custom hardware for stream applications is presented. The template for the custom hardware is a pipeline of accelerators. The synthesis involves designing loop accelerators for individual kernels, instantiating buffers to store data passed between kernels, and linking these building blocks to form a pipeline. A unique aspect of this system is the use of multifunction accelerators, which improves cost by efficiently sharing hardware between multiple kernels. Finally, a method to improve the integer linear program formulations used in the schedulers that exploits symmetry in the solution space is presented. Symmetry-breaking constraints are added to the formulation, and the performance of the solver is evaluated.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/61662/1/kvman_1.pd

    A hybrid static/dynamic approach to scheduling stream programs

    Get PDF
    Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2009.Cataloged from PDF version of thesis.Includes bibliographical references (p. 95-97).Streaming languages such as Streamlt are often utilized to write stream programs that execute on multicore processors. Stream programs consist of actors that operate on streams of data. To execute on multiple cores, actors are scheduled for parallel execution while satisfying data dependencies between actors. In StreamIt, the compiler analyzes data dependencies between actors at compile-time and generates a static schedule that determines where and when actors are executed on the available cores. Statically scheduling actors onto cores results in no scheduling overhead at runtime and allows for sophisticated compile-time scheduling optimizations. Unfortunately, static scheduling has a number of severe limitations. The generated static schedule is inflexible and cannot be adapted to run-time conditions, such as cores that are unexpectedly unavailable. Static scheduling may also incorrectly load-balance cores due to inaccurate static work estimates. This thesis contributes a hybrid static/dynamic scheduling approach that attempts to address the limitations of static scheduling. Dynamic load-balancing is utilized to adjust the static schedule to run-time conditions and to correct load imbalances that might exist after static scheduling. Dynamic load-balancing is designed to add very little run-time overhead.by Ceryen Tan.M.Eng

    Modeling and Mapping of Optimized Schedules for Embedded Signal Processing Systems

    Get PDF
    The demand for Digital Signal Processing (DSP) in embedded systems has been increasing rapidly due to the proliferation of multimedia- and communication-intensive devices such as pervasive tablets and smart phones. Efficient implementation of embedded DSP systems requires integration of diverse hardware and software components, as well as dynamic workload distribution across heterogeneous computational resources. The former implies increased complexity of application modeling and analysis, but also brings enhanced potential for achieving improved energy consumption, cost or performance. The latter results from the increased use of dynamic behavior in embedded DSP applications. Furthermore, parallel programming is highly relevant in many embedded DSP areas due to the development and use of Multiprocessor System-On-Chip (MPSoC) technology. The need for efficient cooperation among different devices supporting diverse parallel embedded computations motivates high-level modeling that expresses dynamic signal processing behaviors and supports efficient task scheduling and hardware mapping. Starting with dynamic modeling, this thesis develops a systematic design methodology that supports functional simulation and hardware mapping of dynamic reconfiguration based on Parameterized Synchronous Dataflow (PSDF) graphs. By building on the DIF (Dataflow Interchange Format), which is a design language and associated software package for developing and experimenting with dataflow-based design techniques for signal processing systems, we have developed a novel tool for functional simulation of PSDF specifications. This simulation tool allows designers to model applications in PSDF and simulate their functionality, including use of the dynamic parameter reconfiguration capabilities offered by PSDF. With the help of this simulation tool, our design methodology helps to map PSDF specifications into efficient implementations on field programmable gate arrays (FPGAs). Furthermore, valid schedules can be derived from the PSDF models at runtime to adapt hardware configurations based on changing data characteristics or operational requirements. Under certain conditions, efficient quasi-static schedules can be applied to reduce overhead and enhance predictability in the scheduling process. Motivated by the fact that scheduling is critical to performance and to efficient use of dynamic reconfiguration, we have focused on a methodology for schedule design, which complements the emphasis on automated schedule construction in the existing literature on dataflow-based design and implementation. In particular, we have proposed a dataflow-based schedule design framework called the dataflow schedule graph (DSG), which provides a graphical framework for schedule construction based on dataflow semantics, and can also be used as an intermediate representation target for automated schedule generation. Our approach to applying the DSG in this thesis emphasizes schedule construction as a design process rather than an outcome of the synthesis process. Our approach employs dataflow graphs for representing both application models and schedules that are derived from them. By providing a dataflow-integrated framework for unambiguously representing, analyzing, manipulating, and interchanging schedules, the DSG facilitates effective codesign of dataflow-based application models and schedules for execution of these models. As multicore processors are deployed in an increasing variety of embedded image processing systems, effective utilization of resources such as multiprocessor systemon-chip (MPSoC) devices, and effective handling of implementation concerns such as memory management and I/O become critical to developing efficient embedded implementations. However, the diversity and complexity of applications and architectures in embedded image processing systems make the mapping of applications onto MPSoCs difficult. We help to address this challenge through a structured design methodology that is built upon the DSG modeling framework. We refer to this methodology as the DEIPS methodology (DSG-based design and implementation of Embedded Image Processing Systems). The DEIPS methodology provides a unified framework for joint consideration of DSG structures and the application graphs from which they are derived, which allows designers to integrate considerations of parallelization and resource constraints together with the application modeling process. We demonstrate the DEIPS methodology through cases studies on practical embedded image processing systems

    ACOTES project: Advanced compiler technologies for embedded streaming

    Get PDF
    Streaming applications are built of data-driven, computational components, consuming and producing unbounded data streams. Streaming oriented systems have become dominant in a wide range of domains, including embedded applications and DSPs. However, programming efficiently for streaming architectures is a challenging task, having to carefully partition the computation and map it to processes in a way that best matches the underlying streaming architecture, taking into account the distributed resources (memory, processing, real-time requirements) and communication overheads (processing and delay). These challenges have led to a number of suggested solutions, whose goal is to improve the programmer’s productivity in developing applications that process massive streams of data on programmable, parallel embedded architectures. StreamIt is one such example. Another more recent approach is that developed by the ACOTES project (Advanced Compiler Technologies for Embedded Streaming). The ACOTES approach for streaming applications consists of compiler-assisted mapping of streaming tasks to highly parallel systems in order to maximize cost-effectiveness, both in terms of energy and in terms of design effort. The analysis and transformation techniques automate large parts of the partitioning and mapping process, based on the properties of the application domain, on the quantitative information about the target systems, and on programmer directives. This paper presents the outcomes of the ACOTES project, a 3-year collaborative work of industrial (NXP, ST, IBM, Silicon Hive, NOKIA) and academic (UPC, INRIA, MINES ParisTech) partners, and advocates the use of Advanced Compiler Technologies that we developed to support Embedded Streaming.Peer ReviewedPostprint (published version

    Performance Aspects of Synthesizable Computing Systems

    Get PDF
    • …
    corecore