17 research outputs found

    Dynamic Trace-Based Data Dependency Analysis for Parallelization of C Programs

    Get PDF
    Writing parallel code is traditionally considered a difficult task, even when it is tackled from the beginning of a project. In this paper, we demonstrate an innovative toolset that faces this challenge directly. It provides the software developers with profile data and directs them to possible top-level, pipeline-style parallelization opportunities for an arbitrary sequential C program. This approach is complementary to the methods based on static code analysis and automatic code rewriting and does not impose restrictions on the structure of the sequential code or the parallelization style, even though it is mostly aimed at coarse-grained task-level parallelization. The proposed toolset has been utilized to define parallel code organizations for a number of real-world representative applications and is based on and is provided as free source

    Factoring out ordered sections to expose thread-level parallelism

    Get PDF
    With the rise of multi-core processors, researchers are taking a new look at extending the applicability auto-parallelization techniques. In this paper, we identify a dependence pattern on which autoparallelization currently fails. This dependence pattern occurs for ordered sections, i.e. code fragments in a loop that must be executed atomically and in original program order. We discuss why these ordered sections prohibit current auto-parallelizers from working and we present a technique to deal with them. We experimentally demonstrate the efficacy of the technique, yielding significant overall program speedups

    Interactive Trace-Based Analysis Toolset for Manual Parallelization of C Programs

    Get PDF
    Massive amounts of legacy sequential code need to be parallelized to make better use of modern multiprocessor architectures. Nevertheless, writing parallel programs is still a difficult task. Automated parallelization methods can be effective both at the statement and loop levels and, recently, at the task level, but they are still restricted to specific source code constructs or application domains. We present in this article an innovative toolset that supports developers when performing manual code analysis and parallelization decisions. It automatically collects and represents the program profile and data dependencies in an interactive graphical format that facilitates the analysis and discovery of manual parallelization opportunities. The toolset can be used for arbitrary sequential C programs and parallelization patterns. Also, its program-scope data dependency tracing at runtime can complement the tools based on static code analysis and can also benefit from it at the same time. We also tested the effectiveness of the toolset in terms of time to reach parallelization decisions and of their quality. We measured a significant improvement for several real-world representative applications

    Helium: lifting high-performance stencil kernels from stripped x86 binaries to halide DSL code

    Get PDF
    Highly optimized programs are prone to bit rot, where performance quickly becomes suboptimal in the face of new hardware and compiler techniques. In this paper we show how to automatically lift performance-critical stencil kernels from a stripped x86 binary and generate the corresponding code in the high-level domain-specific language Halide. Using Halide’s state-of-the-art optimizations targeting current hardware, we show that new optimized versions of these kernels can replace the originals to rejuvenate the application for newer hardware. The original optimized code for kernels in stripped binaries is nearly impossible to analyze statically. Instead, we rely on dynamic traces to regenerate the kernels. We perform buffer structure reconstruction to identify input, intermediate and output buffer shapes. We abstract from a forest of concrete dependency trees which contain absolute memory addresses to symbolic trees suitable for high-level code generation. This is done by canonicalizing trees, clustering them based on structure, inferring higher-dimensional buffer accesses and finally by solving a set of linear equations based on buffer accesses to lift them up to simple, high-level expressions. Helium can handle highly optimized, complex stencil kernels with input-dependent conditionals. We lift seven kernels from Adobe Photoshop giving a 75% performance improvement, four kernels from IrfanView, leading to 4.97× performance, and one stencil from the miniGMG multigrid benchmark netting a 4.25× improvement in performance. We manually rejuvenated Photoshop by replacing eleven of Photoshop’s filters with our lifted implementations, giving 1.12× speedup without affecting the user experience.United States. Dept. of Energy (Award DE-SC0005288)United States. Dept. of Energy (Award DE-SC0008923)United States. Defense Advanced Research Projects Agency (Agreement FA8759-14-2-0009)MIT Energy Initiative (Fellowship

    On-the-fly pipeline parallelism

    Get PDF
    Pipeline parallelism organizes a parallel program as a linear sequence of s stages. Each stage processes elements of a data stream, passing each processed data element to the next stage, and then taking on a new element before the subsequent stages have necessarily completed their processing. Pipeline parallelism is used especially in streaming applications that perform video, audio, and digital signal processing. Three out of 13 benchmarks in PARSEC, a popular software benchmark suite designed for shared-memory multiprocessors, can be expressed as pipeline parallelism. Whereas most concurrency platforms that support pipeline parallelism use a "construct-and-run" approach, this paper investigates "on-the-fly" pipeline parallelism, where the structure of the pipeline emerges as the program executes rather than being specified a priori. On-the-fly pipeline parallelism allows the number of stages to vary from iteration to iteration and dependencies to be data dependent. We propose simple linguistics for specifying on-the-fly pipeline parallelism and describe a provably efficient scheduling algorithm, the Piper algorithm, which integrates pipeline parallelism into a work-stealing scheduler, allowing pipeline and fork-join parallelism to be arbitrarily nested. The Piper algorithm automatically throttles the parallelism, precluding "runaway" pipelines. Given a pipeline computation with T[subscript 1] work and T[subscript ∞] span (critical-path length), Piper executes the computation on P processors in T[subscript P]≤ T[subscript 1]/P + O(T[subscript ∞] + lg P) expected time. Piper also limits stack space, ensuring that it does not grow unboundedly with running time. We have incorporated on-the-fly pipeline parallelism into a Cilk-based work-stealing runtime system. Our prototype Cilk-P implementation exploits optimizations such as lazy enabling and dependency folding. We have ported the three PARSEC benchmarks that exhibit pipeline parallelism to run on Cilk-P. One of these, x264, cannot readily be executed by systems that support only construct-and-run pipeline parallelism. Benchmark results indicate that Cilk-P has low serial overhead and good scalability. On x264, for example, Cilk-P exhibits a speedup of 13.87 over its respective serial counterpart when running on 16 processors.National Science Foundation (U.S.) (Grant CNS-1017058)National Science Foundation (U.S.) (Grant CCF-1162148)National Science Foundation (U.S.). Graduate Research Fellowshi

    A practical approach to exploiting coarse-grained pipeline parallelism in C programs

    No full text
    The emergence of multicore processors has heightened the need for effective parallel programming practices. In addition to writing new parallel programs, the next generation of programmers will be faced with the overwhelming task of migrating decades ’ worth of legacy C code into a parallel representation. Addressing this problem requires a toolset of parallel programming primitives that can broadly apply to both new and existing programs. While tools such as threads and OpenMP allow programmers to express task and data parallelism, support for pipeline parallelism is distinctly lacking. In this paper, we offer a new and pragmatic approach to leveraging coarse-grained pipeline parallelism in C programs. We target the domain of streaming applications, such as audio, video, and digital signal processing, which exhibit regular flows of data. To exploit pipeline parallelism, we equip the programmer with a simple set of annotations (indicating pipeline boundaries) and a dynamic analysis that tracks all communication across those boundaries. Our analysis outputs a stream graph of the application as well as a set of macros for parallelizing the program and communicating the data needed. We apply our methodology to six case studies, including MPEG-2 decoding, MP3 decoding, GMTI radar processing, and three SPEC benchmarks. Our analysis extracts a useful block diagram for each application, and the parallelized versions offer a 2.78x mean speedup on a 4-core machine. 1

    Mechanism for speculative execution of applications parallelized by DOPIPE techniques using stage replication

    Get PDF
    Orientador: Guido Costa Souza de AraújoDissertação (mestrado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: A utilização máxima dos núcleos de arquiteturas multi-processadas é fundamental para permitir uma utilização completa do paralelismo disponível em processadores modernos. A fim de obter desempenho escalável, técnicas de paralelização requerem um ajuste cuidadoso de: (a) mecanismo arquitetural para especulação; (b) ambiente de execução; e (c) transformações baseadas em software. Mecanismos de hardware e software já foram propostos para tratar esse problema. Estes mecanismos, ou requerem alterações profundas (e arriscadas) nos protocolos de coerência de cache, ou exibem uma baixa escalabilidade de desempenho para uma gama de aplicações. Trabalhos recentes em técnicas de paralelização baseadas em DOPIPE (como DSWP) sugerem que a combinação de versionamento de dados baseado em paginação com especulação em software pode resultar em bons ganhos de desempenho. Embora uma solução apenas em software pareça atrativa do ponto de vista da indústria, essa não utiliza todo o potencial da microarquitetura para detectar e explorar paralelismo. A adição de tags às caches para habilitar o versionamento de dados, conforme recentemente anunciado pela indústria, pode permitir uma melhor exploração de paralelismo no nível da microarquitetura. Neste trabalho, é apresentado um modelo de execução que permite tanto a especulação baseada em DOPIPE, como as técnicas de paralelização especulativas tradicionais. Este modelo é baseado em uma simples abordagem com tags de cache para o versionamento de dados, que interage naturalmente com protocolos de coerência de cache tradicionais, não necessitando que estes sejam alterados. Resultados experimentais, utilizando benchmarks SPEC e PARSEC, revelam um ganho de desempenho geométrico médio de 21.6× para nove programas sequenciais em uma máquina simulada de 24 núcleos, demonstrando uma melhora na escalabilidade quando comparada a uma abordagem apenas em softwareAbstract: Maximal utilization of cores in multicore architectures is key to realize the potential performance available from modern microprocessors. In order to achieve scalable performance, parallelization techniques rely on carefully tunning speculative architecture support, runtime environment and software-based transformations. Hardware and software mechanisms have already been proposed to address this problem. They either require deep (and risky) changes on the existing hardware and cache coherence protocols, or exhibit poor performance scalability for a range of applications. Recent work on DOPIPE-based parallelization techniques (e.g. DSWP) has suggested that the combination of page-based data versioning with software speculation can result in good speed-ups. Although a softwareonly solution seems very attractive from an industry point-of-view, it does not enable the whole potential of the microarchitecture in detecting and exploiting parallelism. The addition of cache tags as an enabler for data versioning, as recently announced in the industry, could allow a better exploitation of parallelism at the microarchitecture level. In this paper we present an execution model that supports both DOPIPE-based speculation and traditional speculative parallelization techniques. It is based on a simple cache tagging approach for data versioning, which integrates smoothly with typical cache coherence protocols, and does not require any changes to them. Experimental results, using SPEC and PARSEC benchmarks, reveal a geometric mean speedup of 21.6x for nine sequential programs in a 24-core simulated CMP, while demonstrate improved scalability when compared to a software-only approachMestradoCiência da ComputaçãoMestre em Ciência da Computaçã
    corecore