268 research outputs found

    Feedback Driven Annotation and Refactoring of Parallel Programs

    Get PDF

    Mitosis based speculative multithreaded architectures

    Get PDF
    In the last decade, industry made a right-hand turn and shifted towards multi-core processor designs, also known as Chip-Multi-Processors (CMPs), in order to provide further performance improvements under a reasonable power budget, design complexity, and validation cost. Over the years, several processor vendors have come out with multi-core chips in their product lines and they have become mainstream, with the number of cores increasing in each processor generation. Multi-core processors improve the performance of applications by exploiting Thread Level Parallelism (TLP) while the Instruction Level Parallelism (ILP) exploited by each individual core is limited. These architectures are very efficient when multiple threads are available for execution. However, single-thread sections of code (single-thread applications and serial sections of parallel applications) pose important constraints on the benefits achieved by parallel execution, as pointed out by Amdahl’s law. Parallel programming, even with the help of recently proposed techniques like transactional memory, has proven to be a very challenging task. On the other hand, automatically partitioning applications into threads may be a straightforward task in regular applications, but becomes much harder for irregular programs, where compilers usually fail to discover sufficient TLP. In this scenario, two main directions have been followed in the research community to take benefit of multi-core platforms: Speculative Multithreading (SpMT) and Non-Speculative Clustered architectures. The former splits a sequential application into speculative threads, while the later partitions the instructions among the cores based on data-dependences but avoid large degree of speculation. Despite the large amount of research on both these approaches, the proposed techniques so far have shown marginal performance improvements. In this thesis we propose novel schemes to speed-up sequential or lightly threaded applications in multi-core processors that effectively address the main unresolved challenges of previous approaches. In particular, we propose a SpMT architecture, called Mitosis, that leverages a powerful software value prediction technique to manage inter-thread dependences, based on pre-computation slices (p-slices). Thanks to the accuracy and low cost of this technique, Mitosis is able to effectively parallelize applications even in the presence of frequent dependences among threads. We also propose a novel architecture, called Anaphase, that combines the best of SpMT schemes and clustered architectures. Anaphase effectively exploits ILP, TLP and Memory Level Parallelism (MLP), thanks to its unique finegrain thread decomposition algorithm that adapts to the available parallelism in the application.Postprint (published version

    Mitosis based speculative multithreaded architectures

    Get PDF
    In the last decade, industry made a right-hand turn and shifted towards multi-core processor designs, also known as Chip-Multi-Processors (CMPs), in order to provide further performance improvements under a reasonable power budget, design complexity, and validation cost. Over the years, several processor vendors have come out with multi-core chips in their product lines and they have become mainstream, with the number of cores increasing in each processor generation. Multi-core processors improve the performance of applications by exploiting Thread Level Parallelism (TLP) while the Instruction Level Parallelism (ILP) exploited by each individual core is limited. These architectures are very efficient when multiple threads are available for execution. However, single-thread sections of code (single-thread applications and serial sections of parallel applications) pose important constraints on the benefits achieved by parallel execution, as pointed out by Amdahl’s law. Parallel programming, even with the help of recently proposed techniques like transactional memory, has proven to be a very challenging task. On the other hand, automatically partitioning applications into threads may be a straightforward task in regular applications, but becomes much harder for irregular programs, where compilers usually fail to discover sufficient TLP. In this scenario, two main directions have been followed in the research community to take benefit of multi-core platforms: Speculative Multithreading (SpMT) and Non-Speculative Clustered architectures. The former splits a sequential application into speculative threads, while the later partitions the instructions among the cores based on data-dependences but avoid large degree of speculation. Despite the large amount of research on both these approaches, the proposed techniques so far have shown marginal performance improvements. In this thesis we propose novel schemes to speed-up sequential or lightly threaded applications in multi-core processors that effectively address the main unresolved challenges of previous approaches. In particular, we propose a SpMT architecture, called Mitosis, that leverages a powerful software value prediction technique to manage inter-thread dependences, based on pre-computation slices (p-slices). Thanks to the accuracy and low cost of this technique, Mitosis is able to effectively parallelize applications even in the presence of frequent dependences among threads. We also propose a novel architecture, called Anaphase, that combines the best of SpMT schemes and clustered architectures. Anaphase effectively exploits ILP, TLP and Memory Level Parallelism (MLP), thanks to its unique finegrain thread decomposition algorithm that adapts to the available parallelism in the application

    Improving cache locality for thread-level speculation

    Full text link

    HELIX-RC

    Get PDF
    Data dependences in sequential programs limit parallelization because extracted threads cannot run independently. Although thread-level speculation can avoid the need for precise dependence analysis, communication overheads required to synchronize actual dependences counteract the benefits of parallelization. To address these challenges, we propose a lightweight architectural enhancement co-designed with a parallelizing compiler, which together can decouple communication from thread execution. Simulations of these approaches, applied to a processor with 16 Intel Atom-like cores, show an average of 6.85x performance speedup for six SPEC CINT2000 benchmarksThis work was possible thanks to the sponsorship of the Royal Academy of Engineering, EPSRC and the National Science Foundation (award number IIS-0926148).This is the accepted manuscript. The final version is available from IEEE and ACM at http://dl.acm.org/citation.cfm?doid=2678373.2665705

    Runtime-adaptive generalized task parallelism

    Get PDF
    Multi core systems are ubiquitous nowadays and their number is ever increasing. And while, limited by physical constraints, the computational power of the individual cores has been stagnating or even declining for years, a solution to effectively utilize the computational power that comes with the additional cores is yet to be found. Existing approaches to automatic parallelization are often highly specialized to exploit the parallelism of specific program patterns, and thus to parallelize a small subset of programs only. In addition, frequently used invasive runtime systems prohibit the combination of different approaches, which impedes the practicality of automatic parallelization. In the following thesis, we show that specializing to narrowly defined program patterns is not necessary to efficiently parallelize applications coming from different domains. We develop a generalizing approach to parallelization, which, driven by an underlying mathematical optimization problem, is able to make qualified parallelization decisions taking into account the involved runtime overhead. In combination with a specializing, adaptive runtime system the approach is able to match and even exceed the performance results achieved by specialized approaches.Mehrkernsysteme sind heutzutage allgegenwärtig und finden täglich weitere Verbreitung. Und während, limitiert durch die Grenzen des physikalisch Machbaren, die Rechenkraft der einzelnen Kerne bereits seit Jahren stagniert oder gar sinkt, existiert bis heute keine zufriedenstellende Lösung zur effektiven Ausnutzung der gebotenen Rechenkraft, die mit der steigenden Anzahl an Kernen einhergeht. Existierende Ansätze der automatischen Parallelisierung sind häufig hoch spezialisiert auf die Ausnutzung bestimmter Programm-Muster, und somit auf die Parallelisierung weniger Programmteile. Hinzu kommt, dass häufig verwendete invasive Laufzeitsysteme die Kombination mehrerer Parallelisierungs-Ansätze verhindern, was der Praxistauglichkeit und Reichweite automatischer Ansätze im Wege steht. In der Ihnen vorliegenden Arbeit zeigen wir, dass die Spezialisierung auf eng definierte Programmuster nicht notwendig ist, um Parallelität in Programmen verschiedener Domänen effizient auszunutzen. Wir entwickeln einen generalisierenden Ansatz der Parallelisierung, der, getrieben von einem mathematischen Optimierungsproblem, in der Lage ist, fundierte Parallelisierungsentscheidungen unter Berücksichtigung relevanter Kosten zu treffen. In Kombination mit einem spezialisierenden und adaptiven Laufzeitsystem ist der entwickelte Ansatz in der Lage, mit den Ergebnissen spezialisierter Ansätze mitzuhalten, oder diese gar zu übertreffen.Part of the work presented in this thesis was performed in the context of the SoftwareCluster project EMERGENT (http://www.software-cluster.org). It was funded by the German Federal Ministry of Education and Research (BMBF) under grant no. “01IC10S01”. Later work has been supported, also by the German Federal Ministry of Education and Research (BMBF), through funding for the Center for IT-Security, Privacy and Accountability (CISPA) under grant no. “16KIS0344”

    Compile-time support for thread-level speculation

    Get PDF
    Una de las principales preocupaciones de las ciencias de la computación es el estudio de las capacidades paralelas tanto de programas como de los procesadores que los ejecutan. Existen varias razones que hacen muy deseable el desarrollo de técnicas que paralelicen automáticamente el código. Entre ellas se encuentran el inmenso número de programas secuenciales existentes ya escritos, la complejidad de los lenguajes de programación paralelos, y los conocimientos que se requieren para paralelizar un código. Sin embargo, los actuales mecanismos de paralelización automática implementados en los compiladores comerciales no son capaces de paralelizar la mayoría de los bucles en un código [1], debido a la dependencias de datos que existen entre ellos [2]. Por lo tanto, se hace necesaria la búsqueda de nuevas técnicas, como la paralelización especulativa [3-5], que saquen beneficio de las potenciales capacidades paralelas del hardware y arquitecturas multiprocesador actuales. Sin embargo, ésta y otras técnicas requieren la intervención manual de programadores experimentados. Antes de ofrecer soluciones alternativas, se han evaluado las capacidades de paralelización de los compiladores comerciales, exponiendo las limitaciones de los mecanismos de paralelización automática que implementan. El estudio revela que estos mecanismos de paralelización automática sólo alcanzan un 19% de speedup en promedio para los benchmarks del SPEC CPU2006 [6], siendo este un resultado significativamente inferior al obtenido por técnicas de paralelización especulativa [7]. Sin embargo, la paralelización especulativa requiere una extensa modificación manual del código por parte de programadores. Esta Tesis aborda este problema definiendo una nueva cláusula OpenMP [8], llamada ¿speculative¿, que permite señalar qué variables pueden llevar a una violación de dependencia. Además, esta Tesis también propone un sistema en tiempo de compilación que, usando la información sobre los accesos a las variables que proporcionan las cláusulas OpenMP, añade automáticamente todo el código necesario para gestionar la ejecución especulativa de un programa. Esto libera al programador de modificar el código manualmente, evitando posibles errores y una tediosa tarea. El código generado por nuestro sistema enlaza con la librería de ejecución especulativamente paralela desarrollada por Estebanez, García-Yagüez, Llanos y Gonzalez-Escribano [9,10].Departamento de Informática (Arquitectura y Tecnología de Computadores, Ciencias de la Computación e Inteligencia Artificial, Lenguajes y Sistemas Informáticos

    Parallelizing Julia with a Non-Invasive DSL

    Get PDF
    Computational scientists often prototype software using productivity languages that offer high-level programming abstractions. When higher performance is needed, they are obliged to rewrite their code in a lower-level efficiency language. Different solutions have been proposed to address this trade-off between productivity and efficiency. One promising approach is to create embedded domain-specific languages that sacrifice generality for productivity and performance, but practical experience with DSLs points to some road blocks preventing widespread adoption. This paper proposes a non-invasive domain-specific language that makes as few visible changes to the host programming model as possible. We present ParallelAccelerator, a library and compiler for high-level, high-performance scientific computing in Julia. ParallelAccelerator\u27s programming model is aligned with existing Julia programming idioms. Our compiler exposes the implicit parallelism in high-level array-style programs and compiles them to fast, parallel native code. Programs can also run in "library-only" mode, letting users benefit from the full Julia environment and libraries. Our results show encouraging performance improvements with very few changes to source code required. In particular, few to no additional type annotations are necessary
    corecore