293 research outputs found
Efficient resources assignment schemes for clustered multithreaded processors
New feature sizes provide larger number of transistors per chip that architects could use in order to further exploit instruction level parallelism. However, these technologies bring also new challenges that complicate conventional monolithic processor designs. On the one hand, exploiting instruction level parallelism is leading us to diminishing returns and therefore exploiting other sources of parallelism like thread level parallelism is needed in order to keep raising performance with a reasonable hardware complexity. On the other hand, clustering architectures have been widely studied in order to reduce the inherent complexity of current monolithic processors. This paper studies the synergies and trade-offs between two concepts, clustering and simultaneous multithreading (SMT), in order to understand the reasons why conventional SMT resource assignment schemes are not so effective in clustered processors. These trade-offs are used to propose a novel resource assignment scheme that gets and average speed up of 17.6% versus Icount improving fairness in 24%.Peer ReviewedPostprint (published version
Out-of-Order Retirement of Instructions in Superscalar, Multithreaded, and Multicore Processors
Los procesadores superescalares actuales utilizan un reorder buffer (ROB) para contabilizar las instrucciones en vuelo. El ROB se implementa como una cola FIFO first in first out en la que las instrucciones se insertan en orden de programa después de ser decodificadas, y de la que se extraen también en orden de programa en la etapa commit. El uso de esta estructura proporciona un soporte simple para la especulación, las excepciones precisas y la reclamación de registros. Sin embargo, el hecho de retirar instrucciones en orden puede degradar las prestaciones si una operación de alta latencia está bloqueando la cabecera del ROB. Varias propuestas se han publicado atacando este problema. La mayoría utiliza retirada de instrucciones fuera de orden de forma especulativa, requiriendo almacenar puntos de recuperación (checkpoints) para restaurar un estado válido del procesador ante un fallo de especulación. Normalmente, los checkpoints necesitan implementarse con estructuras hardware costosas, y además requieren un crecimiento de otras estructuras del procesador, lo cual a su vez puede impactar en el tiempo de ciclo de reloj. Este problema afecta a muchos tipos de procesadores actuales, independientemente del número de hilos hardware (threads) y del número de núcleos de cómputo (cores) que incluyan. Esta tesis abarca el estudio de la retirada no especulativa de instrucciones fuera de orden en procesadores superescalares, multithread y multicore.Ubal Tena, R. (2010). Out-of-Order Retirement of Instructions in Superscalar, Multithreaded, and Multicore Processors [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/8535Palanci
Improving processor efficiency by exploiting common-case behaviors of memory instructions
Processor efficiency can be described with the help of a number of desirable
effects or metrics, for example, performance, power, area, design
complexity and access latency.
These metrics serve as valuable tools used in designing new processors
and they also act as effective standards for comparing current processors.
Various factors impact the efficiency of modern out-of-order processors
and one important factor is the manner in which instructions are processed
through the processor pipeline.
In this dissertation research, we study the impact of load and store
instructions
(collectively known as memory instructions) on processor efficiency,
and show how to improve efficiency by exploiting common-case or
predictable patterns in the behavior of memory instructions.
The memory behavior patterns that we focus on in our research are
the predictability of memory dependences, the predictability in
data forwarding patterns,
predictability in instruction criticality and conservativeness
in resource allocation and
deallocation policies.
We first design a scalable and high-performance memory dependence
predictor and then apply
accurate memory dependence prediction to improve the efficiency of
the fetch engine of a simultaneous multi-threaded processor.
We then use predictable data forwarding patterns to eliminate power-hungry
hardware in the processor with no loss in performance. We then move to
studying instruction criticality to improve
processor efficiency. We study the behavior of critical load instructions
and propose applications that can be optimized using predictable,
load-criticality
information. Finally, we explore conventional techniques for
allocation and deallocation
of critical structures that process memory instructions and propose new
techniques to optimize the same. Our new designs have the potential to reduce
the power and the area required by processors significantly without losing
performance, which lead to efficient designs of processors.Ph.D.Committee Chair: Loh, Gabriel H.; Committee Member: Clark, Nathan; Committee Member: Jaleel, Aamer; Committee Member: Kim, Hyesoon; Committee Member: Lee, Hsien-Hsin S.; Committee Member: Prvulovic, Milo
Bandwidth-Aware On-Line Scheduling in SMT Multicores
© 2016 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.The memory hierarchy plays a critical role on the performance of current chip multiprocessors. Main memory is shared by all the running processes, which can cause important bandwidth contention. In addition, when the processor implements SMT cores, the L1 bandwidth becomes shared among the threads running on each core. In such a case, bandwidth-aware schedulers emerge as an interesting approach to mitigate the contention. This work investigates the performance degradation that the processes suffer due to memory bandwidth constraints. Experiments show that main memory and L1 bandwidth contention negatively impact the process performance; in both cases, performance degradation can grow up to 40 percent for some of applications. To deal with contention, we devise a scheduling algorithm that consists of two policies guided by the bandwidth consumption gathered at runtime. The process selection policy balances the number of memory requests over the execution time to address main memory bandwidth contention. The process allocation policy tackles L1 bandwidth contention by balancing the L1 accesses among the L1 caches. The proposal is evaluated on a Xeon E5645 platform using a wide set of multiprogrammed workloads, achieving performance benefits up to 6.7 percent with respect to the Linux scheduler.This work was supported by the Spanish Ministerio de Economia y Competitividad (MINECO) and by FEDER funds under Grant TIN2012-38341-C04-01, and by the Intel Early Career Faculty Honor Program Award.Feliu-Pérez, J.; Sahuquillo Borrás, J.; Petit Martí, SV.; Duato Marín, JF. (2016). Bandwidth-Aware On-Line Scheduling in SMT Multicores. IEEE Transactions on Computers. 65(2):422-434. https://doi.org/10.1109/TC.2015.2428694S42243465
A component-based framework for certification of components in a cloud of HPC services
HPC Shelfis a proposal of a cloud computing platform to provide component-oriented services for High Performance Computing (HPC) applications. This paper presents a Verification-as-a-Service (VaaS) framework for component certification onHPC Shelf. Certification is aimed at providing higher confidence that components of parallel computing systems ofHPC Shelfbehave as expected according to one or more requirements expressed in their contracts. To this end, new abstractions are introduced, starting with certifier components. They are designed to inspect other components and verify them for different types of functional, non-functional and behavioral requirements. The certification framework is naturally based on parallel computing techniques to speed up verification tasks.NORTE-01-0145- FEDER-000037
- …