179 research outputs found
Out-of-Order Retirement of Instructions in Superscalar, Multithreaded, and Multicore Processors
Los procesadores superescalares actuales utilizan un reorder buffer (ROB) para contabilizar las instrucciones en vuelo. El ROB se implementa como una cola FIFO first in first out en la que las instrucciones se insertan en orden de programa después de ser decodificadas, y de la que se extraen también en orden de programa en la etapa commit. El uso de esta estructura proporciona un soporte simple para la especulación, las excepciones precisas y la reclamación de registros. Sin embargo, el hecho de retirar instrucciones en orden puede degradar las prestaciones si una operación de alta latencia está bloqueando la cabecera del ROB. Varias propuestas se han publicado atacando este problema. La mayoría utiliza retirada de instrucciones fuera de orden de forma especulativa, requiriendo almacenar puntos de recuperación (checkpoints) para restaurar un estado válido del procesador ante un fallo de especulación. Normalmente, los checkpoints necesitan implementarse con estructuras hardware costosas, y además requieren un crecimiento de otras estructuras del procesador, lo cual a su vez puede impactar en el tiempo de ciclo de reloj. Este problema afecta a muchos tipos de procesadores actuales, independientemente del número de hilos hardware (threads) y del número de núcleos de cómputo (cores) que incluyan. Esta tesis abarca el estudio de la retirada no especulativa de instrucciones fuera de orden en procesadores superescalares, multithread y multicore.Ubal Tena, R. (2010). Out-of-Order Retirement of Instructions in Superscalar, Multithreaded, and Multicore Processors [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/8535Palanci
Mitosis based speculative multithreaded architectures
In the last decade, industry made a right-hand turn and shifted towards multi-core processor designs, also known as Chip-Multi-Processors (CMPs), in order to provide further performance improvements under a reasonable power budget, design complexity, and validation cost. Over the years, several processor vendors have come out with multi-core chips in their product lines and they have become mainstream,
with the number of cores increasing in each processor generation. Multi-core processors improve the performance of applications by exploiting Thread Level Parallelism (TLP) while the Instruction Level Parallelism (ILP) exploited by each individual core is limited. These architectures are very efficient when multiple threads are available for execution. However, single-thread sections of code (single-thread
applications and serial sections of parallel applications) pose important constraints on the benefits achieved by parallel execution, as pointed out by Amdahl’s law.
Parallel programming, even with the help of recently proposed techniques like transactional memory, has proven to be a very challenging task. On the other hand, automatically partitioning applications into threads may be a straightforward task in regular applications, but becomes much harder for irregular programs, where compilers usually fail to discover sufficient TLP. In this scenario, two main
directions have been followed in the research community to take benefit of multi-core platforms: Speculative Multithreading (SpMT) and Non-Speculative Clustered architectures. The former splits a sequential application into speculative threads, while the later partitions the instructions among the cores based on data-dependences but avoid large degree of speculation. Despite the large amount of research on
both these approaches, the proposed techniques so far have shown marginal performance improvements.
In this thesis we propose novel schemes to speed-up sequential or lightly threaded applications in multi-core processors that effectively address the main unresolved challenges of previous approaches. In particular, we propose a SpMT architecture, called Mitosis, that leverages a powerful software value prediction technique to manage inter-thread dependences, based on pre-computation slices (p-slices).
Thanks to the accuracy and low cost of this technique, Mitosis is able to effectively parallelize applications even in the presence of frequent dependences among threads. We also propose a novel architecture, called Anaphase, that combines the best of SpMT schemes and clustered architectures. Anaphase effectively exploits ILP, TLP and Memory Level Parallelism (MLP), thanks to its unique finegrain thread decomposition algorithm that adapts to the available parallelism in the application
Mitosis based speculative multithreaded architectures
In the last decade, industry made a right-hand turn and shifted towards multi-core processor designs, also known as Chip-Multi-Processors (CMPs), in order to provide further performance improvements under a reasonable power budget, design complexity, and validation cost. Over the years, several processor vendors have come out with multi-core chips in their product lines and they have become mainstream,
with the number of cores increasing in each processor generation. Multi-core processors improve the performance of applications by exploiting Thread Level Parallelism (TLP) while the Instruction Level Parallelism (ILP) exploited by each individual core is limited. These architectures are very efficient when multiple threads are available for execution. However, single-thread sections of code (single-thread
applications and serial sections of parallel applications) pose important constraints on the benefits achieved by parallel execution, as pointed out by Amdahl’s law.
Parallel programming, even with the help of recently proposed techniques like transactional memory, has proven to be a very challenging task. On the other hand, automatically partitioning applications into threads may be a straightforward task in regular applications, but becomes much harder for irregular programs, where compilers usually fail to discover sufficient TLP. In this scenario, two main
directions have been followed in the research community to take benefit of multi-core platforms: Speculative Multithreading (SpMT) and Non-Speculative Clustered architectures. The former splits a sequential application into speculative threads, while the later partitions the instructions among the cores based on data-dependences but avoid large degree of speculation. Despite the large amount of research on
both these approaches, the proposed techniques so far have shown marginal performance improvements.
In this thesis we propose novel schemes to speed-up sequential or lightly threaded applications in multi-core processors that effectively address the main unresolved challenges of previous approaches. In particular, we propose a SpMT architecture, called Mitosis, that leverages a powerful software value prediction technique to manage inter-thread dependences, based on pre-computation slices (p-slices).
Thanks to the accuracy and low cost of this technique, Mitosis is able to effectively parallelize applications even in the presence of frequent dependences among threads. We also propose a novel architecture, called Anaphase, that combines the best of SpMT schemes and clustered architectures. Anaphase effectively exploits ILP, TLP and Memory Level Parallelism (MLP), thanks to its unique finegrain thread decomposition algorithm that adapts to the available parallelism in the application.Postprint (published version
Thread-spawning schemes for speculative multithreading
Speculative multithreading has been recently proposed to boost performance by means of exploiting thread-level parallelism in applications difficult to parallelize. The performance of these processors heavily depends on the partitioning policy used to split the program into threads. Previous work uses heuristics to spawn speculative threads based on easily-detectable program constructs such as loops or subroutines. In this work we propose a profile-based mechanism to divide programs into threads by searching for those parts of the code that have certain features that could benefit from potential thread-level parallelism. Our profile-based spawning scheme is evaluated on a Clustered Speculative Multithreaded Processor and results show large performance benefits. When the proposed spawning scheme is compared with traditional heuristics, we outperform them by almost 20%. When a realistic value predictor and a 8-cycle thread initialization penalty is considered, the performance difference between them is maintained. The speed-up over a single thread execution is higher than 5x for a 16-thread-unit processor and close to 2x for a 4-thread-unit processor.Peer ReviewedPostprint (published version
Recommended from our members
The dynamic simultaneous multithreaded processor
This dissertation investigates diverse techniques to support multithreading in modern high performance processors. The mechanisms studied expand the architecture of a high performance superscalar processor to control efficiently the interaction between software-controlled and hardware-controlled multithreading. Additionally, dynamic speculative mechanisms are proposed to exploit thread-level-parallelism (TLP) and instruction-level-parallelism (ILP) on a Simultaneous Multithreading (SMT) architecture. First, the hybrid multithreaded execution model is discussed. This model combines software-controlled multithreading with hardware support for efficient context switching and thread scheduling. A thread scheduling technique called set scheduling is introduced and its impact on the overall performance is described. An analytical model of the hybrid multithreaded execution is developed and validated by simulation. Through stochastic simulation, we find that the application of the hybrid multithreaded execution model results in higher processor utilization than traditional software-controlled multithreading. Next, in the main part of this dissertation, a new architecture is proposed: the Dynamic Simultaneous Multithreading (DSMT) processor. In this architecture, multiple threads are identified and created speculatively at runtime without compiler help. Subsequently, a SMT processor core executes those threads. The performance of a DSMT processor was evaluated with a new execution-driven simulator developed specifically for the purpose. Our experimental results based on simulation show that DSMT architecture has very good potential to improve SMT processor's performance when there is only a single task available for execution
Performance Enhancement of Multicore Architecture
Multicore processors integrate several cores on a single chip. The fixed architecture of multicore platforms often fails to accommodate the inherent diverse requirements of different applications. The permanent need to enhance the performance of multicore architecture motivates the development of a dynamic architecture. To address this issue, this paper presents new algorithms for thread selection in fetch stage. Moreover, this paper presents three new fetch stage policies, EACH_LOOP_FETCH, INC-FETCH, and WZ-FETCH, based on Ordinary Least Square (OLS) regression statistic method. These new fetch policies differ on thread selection time which is represented by instructions’ count and window size. Furthermore, the simulation multicore tool, , is adapted to cope with multicore processor dynamic design by adding a dynamic feature in the policy of thread selection in fetch stage. SPLASH2, parallel scientific workloads, has been used to validate the proposed adaptation for multi2sim. Intensive simulated experiments have been conducted and the obtained results show that remarkable performance enhancements have been achieved in terms of execution time and number of instructions per second produces less broadcast operations compared to the typical algorithm
Multithreading opportunities for program optimizations
The introduction of Multiprocessor On Chip (CMP) led to a substantial reformulation of the Moore law stating that the number of cores in a single chip doubles every one year and a half.
The tech boom related to CMP gave a strong impulse to parallel program design diminishing its ``gap'' with parallel architectures.
Nowadays a leading trend related to high performance products is represented by CMP with multithreading CPU nodes.
Basically the CPU multithreading feature tries to overcome the underutilization of superscalar processors, due to the lack of exploitable instruction level parallelism (ILP), allowing the simultaneous processing of different programs during the same time slot.
In multithreading architectures a thread is a concurrent computational entity supported directly at firmware level (these threads are usually called hardware threads).
Multithreading technology opens a broad range of possible optimizations that can be applied to improve the performance of sequential and parallel applications.
This thesis treat four possible optimization targeted for multithreading architectures: Speculative Precomputation, Threaded Multipath Execution, Speculative Multithreading and Communication threads.
L'introduzione dei Multiprocessor On Chip (CMP) ha portato ad una sostanziale riformulazione della legge di Moore la quale afferma che il numero di cores in un singolo chip raddoppia ogni anno e mezzo. Il boom tecnologico relativo ai CMP ha dato un grande impulso al design relativo alla programmazione parallela diminuendo il gap con le architetture parallele.
Allo stato attuale delle cose, un trend prominente relativo ai prodotti di high performance computing è rappresentato da CMP con nodi caratterizzati da hardware multithreading.
Questa tecnologia prova a risolvere il sottoutilizzo di processori superscalari, dovuto alla mancanza di ILP (instruction level parallelism), permettendo la computazione simultanea di diversi programmi durante lo stesso time slot
La tecnologia multithreading ha aperto un ampio spettro di possibili ottimizzazioni che possono essere utilizzate al fine di migliorare le performance di applicazioni sequenziali e parallele.
Questa tesi tratta quattro possibili ottimizzazioni indirizzate per architetture multithreading: Speculative Precomputation (Helper Thread), Threaded Multipath Execution, Speculative Multithreading and Communication Threads
Complementing user-level coarse-grain parallelism with implicit speculative parallelism
Multi-core and many-core systems are the norm in contemporary processor technology
and are expected to remain so for the foreseeable future. Parallel programming
is, thus, here to stay and programmers have to endorse it if they are to exploit such
systems for their applications. Programs using parallel programming primitives like
PThreads or OpenMP often exploit coarse-grain parallelism, because it offers a good
trade-off between programming effort versus performance gain. Some parallel applications
show limited or no scaling beyond a number of cores. Given the abundant
number of cores expected in future many-cores, several cores would remain idle in such
cases while execution performance stagnates. This thesis proposes using cores that do
not contribute to performance improvement for running implicit fine-grain speculative
threads. In particular, we present a many-core architecture and protocols that allow
applications with coarse-grain explicit parallelism to further exploit implicit speculative
parallelism within each thread. We show that complementing parallel programs
with implicit speculative mechanisms offers significant performance improvements for
a large and diverse set of parallel benchmarks. Implicit speculative parallelism frees
the programmer from the additional effort to explicitly partition the work into finer
and properly synchronized tasks. Our results show that, for a many-core comprising
128 cores supporting implicit speculative parallelism in clusters of 2 or 4 cores, performance
improves on top of the highest scalability point by 44% on average for the
4-core cluster and by 31% on average for the 2-core cluster. We also show that this
approach often leads to better performance and energy efficiency compared to existing
alternatives such as Core Fusion and Turbo Boost. Moreover, we present a dynamic
mechanism to choose the number of explicit and implicit threads, which performs
within 6% of the static oracle selection of threads.
To improve energy efficiency processors allow for Dynamic Voltage and Frequency
Scaling (DVFS), which enables changing their performance and power consumption
on-the-fly. We evaluate the amenability of the proposed explicit plus implicit threads
scheme to traditional power management techniques for multithreaded applications
and identify room for improvement. We thus augment prior schemes and introduce
a novel multithreaded power management scheme that accounts for implicit threads
and aims to minimize the Energy Delay2 product (ED2). Our scheme comprises two
components: a “local” component that tries to adapt to the different program phases
on a per explicit thread basis, taking into account implicit thread behavior, and a
“global” component that augments the local components with information regarding
inter-thread synchronization. Experimental results show a reduction of ED2 of 8%
compared to having no power management, with an average reduction in power of
15% that comes at a minimal loss of performance of less than 3% on average
- …