7 research outputs found

    Towards automatic parallelization of stream processing applications

    Get PDF
    Parallelizing and optimizing codes for recent multi-/many-core processors have been recognized to be a complex task. For this reason, strategies to automatically transform sequential codes into parallel and discover optimization opportunities are crucial to relieve the burden to developers. In this paper, we present a compile-time framework to (semi) automatically find parallel patterns (Pipeline and Farm) and transform sequential streaming applications into parallel using GrPPI, a generic parallel pattern interface. This framework uses a novel pipeline stage-balancing technique which provides the code generator module with the necessary information to produce balanced pipelines. The evaluation, using a synthetic video benchmark and a real-world computer vision application, demonstrates that the presented framework is capable of producing parallel and optimized versions of the application. A comparison study under several thread-core oversubscribed conditions reveals that the framework can bring comparable performance results with respect to the Intel TBB programming framework.This work was supported in part by the Spanish Ministerio de EconomĂ­a y Competitividad through the Project Toward Uni cation of HPC and Big Data Paradigms under Grant TIN2016-79637-P and in part by the EU Project RePhrase: REfactoring Parallel Heterogeneous Resource-Aware Applications under Grant ICT 644235

    Towards Automatic Parallelization of Stream Processing Applications

    Get PDF
    Parallelizing and optimizing codes for recent multi-/many-core processors have been recognized to be a complex task. For this reason, strategies to automatically transform sequential codes into parallel and discover optimization opportunities are crucial to relieve the burden to developers. In this paper, we present a compile-time framework to (semi) automatically find parallel patterns (Pipeline and Farm) and transform sequential streaming applications into parallel using GrPPI, a generic parallel pattern interface. This framework uses a novel pipeline stage-balancing technique which provides the code generator module with the necessary information to produce balanced pipelines. The evaluation, using a synthetic video benchmark and a real-world computer vision application, demonstrates that the presented framework is capable of producing parallel and optimized versions of the application. A comparison study under several thread-core oversubscribed conditions reveals that the framework can bring comparable performance results with respect to the Intel TBB programming framework

    Compiler techniques for scalable performance of stream programs on multicore architectures

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2010.Cataloged from PDF version of thesis.Includes bibliographical references (p. 211-222).Given the ubiquity of multicore processors, there is an acute need to enable the development of scalable parallel applications without unduly burdening programmers. Currently, programmers are asked not only to explicitly expose parallelism but also concern themselves with issues of granularity, load-balancing, synchronization, and communication. This thesis demonstrates that when algorithmic parallelism is expressed in the form of a stream program, a compiler can effectively and automatically manage the parallelism. Our compiler assumes responsibility for low-level architectural details, transforming implicit algorithmic parallelism into a mapping that achieves scalable parallel performance for a given multicore target. Stream programming is characterized by regular processing of sequences of data, and it is a natural expression of algorithms in the areas of audio, video, digital signal processing, networking, and encryption. Streaming computation is represented as a graph of independent computation nodes that communicate explicitly over data channels. Our techniques operate on contiguous regions of the stream graph where the input and output rates of the nodes are statically determinable. Within a static region, the compiler first automatically adjusts the granularity and then exploits data, task, and pipeline parallelism in a holistic fashion. We introduce techniques that data-parallelize nodes that operate on overlapping sliding windows of their input, translating serializing state into minimal and parametrized inter-core communication. Finally, for nodes that cannot be data-parallelized due to state, we are the first to automatically apply software-pipelining techniques at a coarse granularity to exploit pipeline parallelism between stateful nodes. Our framework is evaluated in the context of the StreamIt programming language. StreamIt is a high-level stream programming language that has been shown to improve programmer productivity in implementing streaming algorithms. We employ the StreamIt Core benchmark suite of 12 real-world applications to demonstrate the effectiveness of our techniques for varying multicore architectures. For a 16-core distributed memory multicore, we achieve a 14.9x mean speedup. For benchmarks that include sliding-window computation, our sliding-window data-parallelization techniques are required to enable scalable performance for a 16-core SMP multicore (14x mean speedup) and a 64-core distributed shared memory multicore (52x mean speedup).by Michael I. Gordon.Ph.D

    Exploiting non-traditional parallelization for application performance and energy efficiency in parallel systems

    No full text
    Multicore processors have become ubiquitous in today's computing platforms, extending from smartphones to data centers. However, exploiting the parallelism that they offer remains difficult, especially for legacy applications and applications with large serial components. Even many parallel applications fail to leverage the ample hardware parallelism and observe scalability limits. This creates a gap between the available hardware and the effective software parallelism. The scenario known as the parallelization wall impedes the performance growth that every processor generation used to bring in. The challenge, then, is to develop techniques that allow multiple cores to work in concert to accelerate a single thread. This dissertation proposes three such techniques -- software data spreading, inter-core prefetching, and load-balanced pipeline parallelism -- and evaluates them on state of the art real systems. These techniques are software only and exploit application level information to best utilize the underlying hardware. Software data spreading migrates a thread intelligently to spread the working set over the aggregate space from different private caches. This reduces expensive cache misses and dramatically improves performance along with energy efficiency when the working set fits in the aggregate cache space. Inter-core prefetching uses one or more helper threads to prefetch data in advance and uses thread migrations to access that data locally. This dissertation extends inter-core prefetching further and introduces two more techniques -- underclocked software prefetching and coalition threading. The former exploits the decoupled execution model of inter-core prefetching to save power. It applies dynamic frequency scaling on the helper thread to leverage its insensitivity to frequency and allows low frequency helper threads to bring the same performance benefits of high frequency helper threads. The latter technique, coalition threading, explores the potential of applying inter-core prefetching on top of traditional parallelism to improve scalability of parallel applications. Finally, this dissertation discusses load- balanced pipeline parallelism that analytically shows how to exploit loop level pipelining to its maximum potentia
    corecore