45 research outputs found

    Mitosis based speculative multithreaded architectures

    Get PDF
    In the last decade, industry made a right-hand turn and shifted towards multi-core processor designs, also known as Chip-Multi-Processors (CMPs), in order to provide further performance improvements under a reasonable power budget, design complexity, and validation cost. Over the years, several processor vendors have come out with multi-core chips in their product lines and they have become mainstream, with the number of cores increasing in each processor generation. Multi-core processors improve the performance of applications by exploiting Thread Level Parallelism (TLP) while the Instruction Level Parallelism (ILP) exploited by each individual core is limited. These architectures are very efficient when multiple threads are available for execution. However, single-thread sections of code (single-thread applications and serial sections of parallel applications) pose important constraints on the benefits achieved by parallel execution, as pointed out by Amdahl’s law. Parallel programming, even with the help of recently proposed techniques like transactional memory, has proven to be a very challenging task. On the other hand, automatically partitioning applications into threads may be a straightforward task in regular applications, but becomes much harder for irregular programs, where compilers usually fail to discover sufficient TLP. In this scenario, two main directions have been followed in the research community to take benefit of multi-core platforms: Speculative Multithreading (SpMT) and Non-Speculative Clustered architectures. The former splits a sequential application into speculative threads, while the later partitions the instructions among the cores based on data-dependences but avoid large degree of speculation. Despite the large amount of research on both these approaches, the proposed techniques so far have shown marginal performance improvements. In this thesis we propose novel schemes to speed-up sequential or lightly threaded applications in multi-core processors that effectively address the main unresolved challenges of previous approaches. In particular, we propose a SpMT architecture, called Mitosis, that leverages a powerful software value prediction technique to manage inter-thread dependences, based on pre-computation slices (p-slices). Thanks to the accuracy and low cost of this technique, Mitosis is able to effectively parallelize applications even in the presence of frequent dependences among threads. We also propose a novel architecture, called Anaphase, that combines the best of SpMT schemes and clustered architectures. Anaphase effectively exploits ILP, TLP and Memory Level Parallelism (MLP), thanks to its unique finegrain thread decomposition algorithm that adapts to the available parallelism in the application.Postprint (published version

    Mitosis based speculative multithreaded architectures

    Get PDF
    In the last decade, industry made a right-hand turn and shifted towards multi-core processor designs, also known as Chip-Multi-Processors (CMPs), in order to provide further performance improvements under a reasonable power budget, design complexity, and validation cost. Over the years, several processor vendors have come out with multi-core chips in their product lines and they have become mainstream, with the number of cores increasing in each processor generation. Multi-core processors improve the performance of applications by exploiting Thread Level Parallelism (TLP) while the Instruction Level Parallelism (ILP) exploited by each individual core is limited. These architectures are very efficient when multiple threads are available for execution. However, single-thread sections of code (single-thread applications and serial sections of parallel applications) pose important constraints on the benefits achieved by parallel execution, as pointed out by Amdahl’s law. Parallel programming, even with the help of recently proposed techniques like transactional memory, has proven to be a very challenging task. On the other hand, automatically partitioning applications into threads may be a straightforward task in regular applications, but becomes much harder for irregular programs, where compilers usually fail to discover sufficient TLP. In this scenario, two main directions have been followed in the research community to take benefit of multi-core platforms: Speculative Multithreading (SpMT) and Non-Speculative Clustered architectures. The former splits a sequential application into speculative threads, while the later partitions the instructions among the cores based on data-dependences but avoid large degree of speculation. Despite the large amount of research on both these approaches, the proposed techniques so far have shown marginal performance improvements. In this thesis we propose novel schemes to speed-up sequential or lightly threaded applications in multi-core processors that effectively address the main unresolved challenges of previous approaches. In particular, we propose a SpMT architecture, called Mitosis, that leverages a powerful software value prediction technique to manage inter-thread dependences, based on pre-computation slices (p-slices). Thanks to the accuracy and low cost of this technique, Mitosis is able to effectively parallelize applications even in the presence of frequent dependences among threads. We also propose a novel architecture, called Anaphase, that combines the best of SpMT schemes and clustered architectures. Anaphase effectively exploits ILP, TLP and Memory Level Parallelism (MLP), thanks to its unique finegrain thread decomposition algorithm that adapts to the available parallelism in the application

    Hybrid Designs for Caches and Cores.

    Full text link
    Processor power constraints have come to the forefront over the last decade, heralded by the stagnation of clock frequency scaling. High-performance core and cache designs often utilize power-hungry techniques to increase parallelism. Conversely, the most energy-efficient designs opt for a serial execution to avoid unnecessary overheads. While both of these extremes constitute one-size-fits-all approaches, a judicious mix of parallel and serial execution has the potential to achieve the best of both high-performing and energy-efficient designs. This dissertation examines such hybrid designs for cores and caches. Firstly, we introduce a novel, hybrid out-of-order/in-order core microarchitecture. Instructions that are steered towards in-order execution skip register allocation, reordering and dynamic scheduling. At the same time, these instructions can interleave on an instruction-by-instruction basis with instructions that continue to benefit from these conventional out-of-order mechanisms. Secondly, this dissertation revisits a hybrid technique introduced for L1 caches, way-prediction, in the context of last-level caches that are larger, have higher associativity, and experience less locality.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/113484/1/sleimanf_1.pd

    Compiler-Driven Software Speculation for Thread-Level Parallelism

    Get PDF
    Current parallelizing compilers can tackle applications exercising regular access patterns on arrays or affine indices, where data dependencies can be expressed in a linear form. Unfortunately, there are cases that independence between statements of code cannot be guaranteed and thus the compiler conservatively produces sequential code. Programs that involve extensive pointer use, irregular access patterns, and loops with unknown number of iterations are examples of such cases. This limits the extraction of parallelism in cases where dependencies are rarely or never triggered at runtime. Speculative parallelism refers to methods employed during program execution that aim to produce a valid parallel execution schedule for programs immune to static parallelization. The motivation for this article is to review recent developments in the area of compiler-driven software speculation for thread-level parallelism and how they came about. The article is divided into two parts. In the first part the fundamentals of speculative parallelization for thread-level parallelism are explained along with a design choice categorization for implementing such systems. Design choices include the ways speculative data is handled, how data dependence violations are detected and resolved, how the correct data are made visible to other threads, or how speculative threads are scheduled. The second part is structured around those design choices providing the advances and trends in the literature with reference to key developments in the area. Although the focus of the article is in software speculative parallelization, a section is dedicated for providing the interested reader with pointers and references for exploring similar topics such as hardware thread-level speculation, transactional memory, and automatic parallelization

    CELLO: Compiler-Assisted Efficient Load-Load Ordering in Data-Race-Free Regions

    Get PDF
    Efficient Total Store Order (TSO) implementations allow loads to execute speculatively out-of-order. To detect order violations, the load queue (LQ) holds all the in-flight loads and is searched on every invalidation and cache eviction. Moreover, in a simultaneous multithreading processor (SMT), stores also search the LQ when writing to cache. LQ searches entail considerable energy consumption. Furthermore, the processor stalls upon encountering the LQ full or when its ports are busy. Hence, the LQ is a critical structure in terms of both energy and performance. In this work, we observe that the use of the LQ could be dramatically optimized under the guarantees of the datarace-free (DRF) property imposed by modern programming languages. To leverage this observation, we propose CELLO, a software-hardware co-design in which the compiler detects memory operations in DRF regions and the hardware optimizes their execution by safely skipping LQ searches without violating the TSO consistency model. Furthermore, CELLO allows removing DRF loads from the LQ earlier, as they do not need to be searched to detect consistency violations. With minimal hardware overhead, we show that an 8-core 2- way SMT processor with CELLO avoids almost all conservative searches to the LQ and significantly reduces its occupancy. CELLO allows i) to reduce the LQ energy expenditure by 33% on average (up to 53%) while performing 2.8% better on average (up to 18.6%) than the baseline system, and ii) to shrink the LQ size from 192 to only 80 entries, reducing the LQ energy expenditure as much as 69% while performing on par with a mainstream LQ implementation

    Microarchitectural Techniques to Exploit Repetitive Computations and Values

    Get PDF
    La dependencia de datos es una de las principales razones que limitan el rendimiento de los procesadores actuales. Algunos estudios han demostrado, que las aplicaciones no pueden alcanzar más de una decena de instrucciones por ciclo en un procesador ideal, con la simple limitación de las dependencias de datos. Esto sugiere que, desarrollar técnicas que eviten la serialización causada por ellas, son importantes para acelerar el paralelismo a nivel de instrucción y será crucial en los microprocesadores del futuro.Además, la innovación y las mejoras tecnológicas en el diseño de los procesadores de los últimos diez años han sobrepasado los avances en el diseño del sistema de memoria. Por lo tanto, la cada vez mas grande diferencia de velocidades de procesador y memoria, ha motivado que, los actuales procesadores de alto rendimiento se centren en las organizaciones cache para tolerar las altas latencias de memoria. Las memorias cache solventan en parte esta diferencia de velocidades, pero a cambio introducen un aumento de área del procesador, un incremento del consumo energético y una mayor demanda de ancho de banda de memoria, de manera que pueden llegar a limitar el rendimiento del procesador.En esta tesis se proponen diversas técnicas microarquitectónicas que pueden aplicarse en diversas partes del procesador, tanto para mejorar el sistema de memoria, como para acelerar la ejecución de instrucciones. Algunas de ellas intentan suavizar la diferencia de velocidades entre el procesador y el sistema de memoria, mientras que otras intentan aliviar la serialización causada por las dependencias de datos. La idea fundamental, tras todas las técnicas propuestas, consiste en aprovechar el alto porcentaje de repetición de los programas convencionales.Las instrucciones ejecutadas por los programas de hoy en día, tienden a ser repetitivas, en el sentido que, muchos de los datos consumidos y producidos por ellas son frecuentemente los mismos. Esta tesis denomina la repetición de cualquier valor fuente y destino como Repetición de Valores, mientras que la repetición de valores fuente y operación de la instrucción se distingue como Repetición de Computaciones. De manera particular, las técnicas propuestas para mejorar el sistema de memoria se basan en explotar la repetición de valores producida por las instrucciones de almacenamiento, mientras que las técnicas propuestas para acelerar la ejecución de instrucciones, aprovechan la repetición de computaciones producida por todas las instrucciones.Data dependences are some of the most important hurdles that limit the performance of current microprocessors. Some studies have shown that some applications cannot achieve more than a few tens of instructions per cycle in an ideal processor with the sole limitation of data dependences. This suggests that techniques for avoiding the serialization caused by them are important for boosting the instruction-level parallelism and will be crucial for future microprocessors. Moreover, innovation and technological improvements in processor design have outpaced advances in memory design in the last ten years. Therefore, the increasing gap between processor and memory speeds has motivated that current high performance processors focus on cache memory organizations to tolerate growing memory latencies. Caches attempt to bridge this gap but do so at the expense of large amounts of die area, increment of the energy consumption and higher demand of memory bandwidth that can be progressively a greater limit to high performance.We propose several microarchitectural techniques that can be applied to various parts of current microprocessor designs to improve the memory system and to boost the execution of instructions. Some techniques attempt to ease the gap between processor and memory speeds, while the others attempt to alleviate the serialization caused by data dependences. The underlying aim behind all the proposed microarchitectural techniques is to exploit the repetitive behaviour in conventional programs. Instructions executed by real-world programs tend to be repetitious, in the sense that most of the data consumed and produced by several dynamic instructions are often the same. We refer to the repetition of any source or result value as Value Repetition and the repetition of source values and operation as Computation Repetition. In particular, the techniques proposed for improving the memory system are based on exploiting the value repetition produced by store instructions, while the techniques proposed for boosting the execution of instructions are based on exploiting the computation repetition produced by all the instructions

    A compiler cost model for speculative multithreading chip-multiprocessor architectures

    Get PDF
    corecore