179 research outputs found

    Out-of-Order Retirement of Instructions in Superscalar, Multithreaded, and Multicore Processors

    Full text link
    Los procesadores superescalares actuales utilizan un reorder buffer (ROB) para contabilizar las instrucciones en vuelo. El ROB se implementa como una cola FIFO first in first out en la que las instrucciones se insertan en orden de programa después de ser decodificadas, y de la que se extraen también en orden de programa en la etapa commit. El uso de esta estructura proporciona un soporte simple para la especulación, las excepciones precisas y la reclamación de registros. Sin embargo, el hecho de retirar instrucciones en orden puede degradar las prestaciones si una operación de alta latencia está bloqueando la cabecera del ROB. Varias propuestas se han publicado atacando este problema. La mayoría utiliza retirada de instrucciones fuera de orden de forma especulativa, requiriendo almacenar puntos de recuperación (checkpoints) para restaurar un estado válido del procesador ante un fallo de especulación. Normalmente, los checkpoints necesitan implementarse con estructuras hardware costosas, y además requieren un crecimiento de otras estructuras del procesador, lo cual a su vez puede impactar en el tiempo de ciclo de reloj. Este problema afecta a muchos tipos de procesadores actuales, independientemente del número de hilos hardware (threads) y del número de núcleos de cómputo (cores) que incluyan. Esta tesis abarca el estudio de la retirada no especulativa de instrucciones fuera de orden en procesadores superescalares, multithread y multicore.Ubal Tena, R. (2010). Out-of-Order Retirement of Instructions in Superscalar, Multithreaded, and Multicore Processors [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/8535Palanci

    Mitosis based speculative multithreaded architectures

    Get PDF
    In the last decade, industry made a right-hand turn and shifted towards multi-core processor designs, also known as Chip-Multi-Processors (CMPs), in order to provide further performance improvements under a reasonable power budget, design complexity, and validation cost. Over the years, several processor vendors have come out with multi-core chips in their product lines and they have become mainstream, with the number of cores increasing in each processor generation. Multi-core processors improve the performance of applications by exploiting Thread Level Parallelism (TLP) while the Instruction Level Parallelism (ILP) exploited by each individual core is limited. These architectures are very efficient when multiple threads are available for execution. However, single-thread sections of code (single-thread applications and serial sections of parallel applications) pose important constraints on the benefits achieved by parallel execution, as pointed out by Amdahl’s law. Parallel programming, even with the help of recently proposed techniques like transactional memory, has proven to be a very challenging task. On the other hand, automatically partitioning applications into threads may be a straightforward task in regular applications, but becomes much harder for irregular programs, where compilers usually fail to discover sufficient TLP. In this scenario, two main directions have been followed in the research community to take benefit of multi-core platforms: Speculative Multithreading (SpMT) and Non-Speculative Clustered architectures. The former splits a sequential application into speculative threads, while the later partitions the instructions among the cores based on data-dependences but avoid large degree of speculation. Despite the large amount of research on both these approaches, the proposed techniques so far have shown marginal performance improvements. In this thesis we propose novel schemes to speed-up sequential or lightly threaded applications in multi-core processors that effectively address the main unresolved challenges of previous approaches. In particular, we propose a SpMT architecture, called Mitosis, that leverages a powerful software value prediction technique to manage inter-thread dependences, based on pre-computation slices (p-slices). Thanks to the accuracy and low cost of this technique, Mitosis is able to effectively parallelize applications even in the presence of frequent dependences among threads. We also propose a novel architecture, called Anaphase, that combines the best of SpMT schemes and clustered architectures. Anaphase effectively exploits ILP, TLP and Memory Level Parallelism (MLP), thanks to its unique finegrain thread decomposition algorithm that adapts to the available parallelism in the application

    Mitosis based speculative multithreaded architectures

    Get PDF
    In the last decade, industry made a right-hand turn and shifted towards multi-core processor designs, also known as Chip-Multi-Processors (CMPs), in order to provide further performance improvements under a reasonable power budget, design complexity, and validation cost. Over the years, several processor vendors have come out with multi-core chips in their product lines and they have become mainstream, with the number of cores increasing in each processor generation. Multi-core processors improve the performance of applications by exploiting Thread Level Parallelism (TLP) while the Instruction Level Parallelism (ILP) exploited by each individual core is limited. These architectures are very efficient when multiple threads are available for execution. However, single-thread sections of code (single-thread applications and serial sections of parallel applications) pose important constraints on the benefits achieved by parallel execution, as pointed out by Amdahl’s law. Parallel programming, even with the help of recently proposed techniques like transactional memory, has proven to be a very challenging task. On the other hand, automatically partitioning applications into threads may be a straightforward task in regular applications, but becomes much harder for irregular programs, where compilers usually fail to discover sufficient TLP. In this scenario, two main directions have been followed in the research community to take benefit of multi-core platforms: Speculative Multithreading (SpMT) and Non-Speculative Clustered architectures. The former splits a sequential application into speculative threads, while the later partitions the instructions among the cores based on data-dependences but avoid large degree of speculation. Despite the large amount of research on both these approaches, the proposed techniques so far have shown marginal performance improvements. In this thesis we propose novel schemes to speed-up sequential or lightly threaded applications in multi-core processors that effectively address the main unresolved challenges of previous approaches. In particular, we propose a SpMT architecture, called Mitosis, that leverages a powerful software value prediction technique to manage inter-thread dependences, based on pre-computation slices (p-slices). Thanks to the accuracy and low cost of this technique, Mitosis is able to effectively parallelize applications even in the presence of frequent dependences among threads. We also propose a novel architecture, called Anaphase, that combines the best of SpMT schemes and clustered architectures. Anaphase effectively exploits ILP, TLP and Memory Level Parallelism (MLP), thanks to its unique finegrain thread decomposition algorithm that adapts to the available parallelism in the application.Postprint (published version

    Thread-spawning schemes for speculative multithreading

    Get PDF
    Speculative multithreading has been recently proposed to boost performance by means of exploiting thread-level parallelism in applications difficult to parallelize. The performance of these processors heavily depends on the partitioning policy used to split the program into threads. Previous work uses heuristics to spawn speculative threads based on easily-detectable program constructs such as loops or subroutines. In this work we propose a profile-based mechanism to divide programs into threads by searching for those parts of the code that have certain features that could benefit from potential thread-level parallelism. Our profile-based spawning scheme is evaluated on a Clustered Speculative Multithreaded Processor and results show large performance benefits. When the proposed spawning scheme is compared with traditional heuristics, we outperform them by almost 20%. When a realistic value predictor and a 8-cycle thread initialization penalty is considered, the performance difference between them is maintained. The speed-up over a single thread execution is higher than 5x for a 16-thread-unit processor and close to 2x for a 4-thread-unit processor.Peer ReviewedPostprint (published version

    Performance Enhancement of Multicore Architecture

    Get PDF
    Multicore processors integrate several cores on a single chip. The fixed architecture of multicore platforms often fails to accommodate the inherent diverse requirements of different applications. The permanent need to enhance the performance of multicore architecture motivates the development of a dynamic architecture. To address this issue, this paper presents new algorithms for thread selection in fetch stage. Moreover, this paper presents three new fetch stage policies, EACH_LOOP_FETCH, INC-FETCH, and WZ-FETCH, based on Ordinary Least Square (OLS) regression statistic method. These new fetch policies differ on thread selection time which is represented by instructions’ count and window size. Furthermore, the simulation multicore tool, , is adapted to cope with multicore processor dynamic design by adding a dynamic feature in the policy of thread selection in fetch stage. SPLASH2, parallel scientific workloads, has been used to validate the proposed adaptation for multi2sim. Intensive simulated experiments have been conducted and the obtained results show that remarkable performance enhancements have been achieved in terms of execution time and number of instructions per second produces less broadcast operations compared to the typical algorithm

    Multithreading opportunities for program optimizations

    Get PDF
    The introduction of Multiprocessor On Chip (CMP) led to a substantial reformulation of the Moore law stating that the number of cores in a single chip doubles every one year and a half. The tech boom related to CMP gave a strong impulse to parallel program design diminishing its ``gap'' with parallel architectures. Nowadays a leading trend related to high performance products is represented by CMP with multithreading CPU nodes. Basically the CPU multithreading feature tries to overcome the underutilization of superscalar processors, due to the lack of exploitable instruction level parallelism (ILP), allowing the simultaneous processing of different programs during the same time slot. In multithreading architectures a thread is a concurrent computational entity supported directly at firmware level (these threads are usually called hardware threads). Multithreading technology opens a broad range of possible optimizations that can be applied to improve the performance of sequential and parallel applications. This thesis treat four possible optimization targeted for multithreading architectures: Speculative Precomputation, Threaded Multipath Execution, Speculative Multithreading and Communication threads. L'introduzione dei Multiprocessor On Chip (CMP) ha portato ad una sostanziale riformulazione della legge di Moore la quale afferma che il numero di cores in un singolo chip raddoppia ogni anno e mezzo. Il boom tecnologico relativo ai CMP ha dato un grande impulso al design relativo alla programmazione parallela diminuendo il gap con le architetture parallele. Allo stato attuale delle cose, un trend prominente relativo ai prodotti di high performance computing è rappresentato da CMP con nodi caratterizzati da hardware multithreading. Questa tecnologia prova a risolvere il sottoutilizzo di processori superscalari, dovuto alla mancanza di ILP (instruction level parallelism), permettendo la computazione simultanea di diversi programmi durante lo stesso time slot La tecnologia multithreading ha aperto un ampio spettro di possibili ottimizzazioni che possono essere utilizzate al fine di migliorare le performance di applicazioni sequenziali e parallele. Questa tesi tratta quattro possibili ottimizzazioni indirizzate per architetture multithreading: Speculative Precomputation (Helper Thread), Threaded Multipath Execution, Speculative Multithreading and Communication Threads

    Compatible phase co-scheduling on a CMP of multi-threaded processors

    Full text link

    Complementing user-level coarse-grain parallelism with implicit speculative parallelism

    Get PDF
    Multi-core and many-core systems are the norm in contemporary processor technology and are expected to remain so for the foreseeable future. Parallel programming is, thus, here to stay and programmers have to endorse it if they are to exploit such systems for their applications. Programs using parallel programming primitives like PThreads or OpenMP often exploit coarse-grain parallelism, because it offers a good trade-off between programming effort versus performance gain. Some parallel applications show limited or no scaling beyond a number of cores. Given the abundant number of cores expected in future many-cores, several cores would remain idle in such cases while execution performance stagnates. This thesis proposes using cores that do not contribute to performance improvement for running implicit fine-grain speculative threads. In particular, we present a many-core architecture and protocols that allow applications with coarse-grain explicit parallelism to further exploit implicit speculative parallelism within each thread. We show that complementing parallel programs with implicit speculative mechanisms offers significant performance improvements for a large and diverse set of parallel benchmarks. Implicit speculative parallelism frees the programmer from the additional effort to explicitly partition the work into finer and properly synchronized tasks. Our results show that, for a many-core comprising 128 cores supporting implicit speculative parallelism in clusters of 2 or 4 cores, performance improves on top of the highest scalability point by 44% on average for the 4-core cluster and by 31% on average for the 2-core cluster. We also show that this approach often leads to better performance and energy efficiency compared to existing alternatives such as Core Fusion and Turbo Boost. Moreover, we present a dynamic mechanism to choose the number of explicit and implicit threads, which performs within 6% of the static oracle selection of threads. To improve energy efficiency processors allow for Dynamic Voltage and Frequency Scaling (DVFS), which enables changing their performance and power consumption on-the-fly. We evaluate the amenability of the proposed explicit plus implicit threads scheme to traditional power management techniques for multithreaded applications and identify room for improvement. We thus augment prior schemes and introduce a novel multithreaded power management scheme that accounts for implicit threads and aims to minimize the Energy Delay2 product (ED2). Our scheme comprises two components: a “local” component that tries to adapt to the different program phases on a per explicit thread basis, taking into account implicit thread behavior, and a “global” component that augments the local components with information regarding inter-thread synchronization. Experimental results show a reduction of ED2 of 8% compared to having no power management, with an average reduction in power of 15% that comes at a minimal loss of performance of less than 3% on average
    corecore