1,174 research outputs found
Efficient resources assignment schemes for clustered multithreaded processors
New feature sizes provide larger number of transistors per chip that architects could use in order to further exploit instruction level parallelism. However, these technologies bring also new challenges that complicate conventional monolithic processor designs. On the one hand, exploiting instruction level parallelism is leading us to diminishing returns and therefore exploiting other sources of parallelism like thread level parallelism is needed in order to keep raising performance with a reasonable hardware complexity. On the other hand, clustering architectures have been widely studied in order to reduce the inherent complexity of current monolithic processors. This paper studies the synergies and trade-offs between two concepts, clustering and simultaneous multithreading (SMT), in order to understand the reasons why conventional SMT resource assignment schemes are not so effective in clustered processors. These trade-offs are used to propose a novel resource assignment scheme that gets and average speed up of 17.6% versus Icount improving fairness in 24%.Peer ReviewedPostprint (published version
On the problem of evaluating the performance of multiprogrammed workloads
Multithreaded architectures are becoming more and more popular. In order to evaluate their behavior, several methodologies and metrics have been proposed. A methodology defines when the measurements for a given workload execution are taken. A metric combines those measurements to obtain a final evaluation result. However, since current evaluation methodologies do not provide representative measurements for these metrics, the analysis and evaluation of novel ideas could be either unfair or misleading. Given the potential impact of multithreaded architectures on current and future processor designs, it is crucial to develop an accurate evaluation methodology for them. This paper presents FAME, a new evaluation methodology aimed to fairly measure the performance of multithreaded processors executing multiprogrammed workloads. FAME reexecutes all programs in the workload until all of them are fairly represented in the final measurements taken. We compare FAME with previously used methodologies showing that it provides more accurate measurements, becoming an ideal evaluation methodology to analyze proposals for multithreaded architectures.Peer ReviewedPostprint (published version
Out-of-Order Retirement of Instructions in Superscalar, Multithreaded, and Multicore Processors
Los procesadores superescalares actuales utilizan un reorder buffer (ROB) para contabilizar las instrucciones en vuelo. El ROB se implementa como una cola FIFO first in first out en la que las instrucciones se insertan en orden de programa después de ser decodificadas, y de la que se extraen también en orden de programa en la etapa commit. El uso de esta estructura proporciona un soporte simple para la especulación, las excepciones precisas y la reclamación de registros. Sin embargo, el hecho de retirar instrucciones en orden puede degradar las prestaciones si una operación de alta latencia está bloqueando la cabecera del ROB. Varias propuestas se han publicado atacando este problema. La mayorÃa utiliza retirada de instrucciones fuera de orden de forma especulativa, requiriendo almacenar puntos de recuperación (checkpoints) para restaurar un estado válido del procesador ante un fallo de especulación. Normalmente, los checkpoints necesitan implementarse con estructuras hardware costosas, y además requieren un crecimiento de otras estructuras del procesador, lo cual a su vez puede impactar en el tiempo de ciclo de reloj. Este problema afecta a muchos tipos de procesadores actuales, independientemente del número de hilos hardware (threads) y del número de núcleos de cómputo (cores) que incluyan. Esta tesis abarca el estudio de la retirada no especulativa de instrucciones fuera de orden en procesadores superescalares, multithread y multicore.Ubal Tena, R. (2010). Out-of-Order Retirement of Instructions in Superscalar, Multithreaded, and Multicore Processors [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/8535Palanci
Evaluation of OpenMP for the Cyclops multithreaded architecture
Multithreaded architectures have the potential of tolerating large memory and functional unit latencies and increase resource utilization. The Blue Gene/Cyclops architecture, being developed at the IBM T. J. Watson Research Center, is one such systems that offers massive intra-chip parallelism. Although the BG/C architecture was initially designed to execute specific applications, we believe that it can be effectively used on a broad range of parallel numerical applications. Programming such applications for this unconventional design requires a significant porting effort when using the basic built-in mechanisms for thread management and synchronization. In this paper, we describe the implementation of an OpenMP environment for parallelizing applications, currently under development at the CEPBA-IBM Research Institute, targeting BG/C. The environment is evaluated with a set of simple numerical kernels and a subset of the NAS OpenMP benchmarks. We identify issues that were not initially considered in the design of the BG/C architecture to support a programming model such as OpenMP. We also evaluate features currently offered by the BG/C architecture that should be considered in the implementation of an efficient OpenMP layer for massive intra-chip parallel architectures.Peer ReviewedPostprint (author's final draft
Large Scale Evolution of Convolutional Neural Networks Using Volunteer Computing
This work presents a new algorithm called evolutionary exploration of
augmenting convolutional topologies (EXACT), which is capable of evolving the
structure of convolutional neural networks (CNNs). EXACT is in part modeled
after the neuroevolution of augmenting topologies (NEAT) algorithm, with
notable exceptions to allow it to scale to large scale distributed computing
environments and evolve networks with convolutional filters. In addition to
multithreaded and MPI versions, EXACT has been implemented as part of a BOINC
volunteer computing project, allowing large scale evolution. During a period of
two months, over 4,500 volunteered computers on the Citizen Science Grid
trained over 120,000 CNNs and evolved networks reaching 98.32% test data
accuracy on the MNIST handwritten digits dataset. These results are even
stronger as the backpropagation strategy used to train the CNNs was fairly
rudimentary (ReLU units, L2 regularization and Nesterov momentum) and these
were initial test runs done without refinement of the backpropagation
hyperparameters. Further, the EXACT evolutionary strategy is independent of the
method used to train the CNNs, so they could be further improved by advanced
techniques like elastic distortions, pretraining and dropout. The evolved
networks are also quite interesting, showing "organic" structures and
significant differences from standard human designed architectures.Comment: 17 pages, 13 figures. Submitted to the 2017 Genetic and Evolutionary
Computation Conference (GECCO 2017
A Survey of Phase Classification Techniques for Characterizing Variable Application Behavior
Adaptable computing is an increasingly important paradigm that specializes
system resources to variable application requirements, environmental
conditions, or user requirements. Adapting computing resources to variable
application requirements (or application phases) is otherwise known as
phase-based optimization. Phase-based optimization takes advantage of
application phases, or execution intervals of an application, that behave
similarly, to enable effective and beneficial adaptability. In order for
phase-based optimization to be effective, the phases must first be classified
to determine when application phases begin and end, and ensure that system
resources are accurately specialized. In this paper, we present a survey of
phase classification techniques that have been proposed to exploit the
advantages of adaptable computing through phase-based optimization. We focus on
recent techniques and classify these techniques with respect to several factors
in order to highlight their similarities and differences. We divide the
techniques by their major defining characteristics---online/offline and
serial/parallel. In addition, we discuss other characteristics such as
prediction and detection techniques, the characteristics used for prediction,
interval type, etc. We also identify gaps in the state-of-the-art and discuss
future research directions to enable and fully exploit the benefits of
adaptable computing.Comment: To appear in IEEE Transactions on Parallel and Distributed Systems
(TPDS
Runahead threads to improve SMT performance
In this paper, we propose Runahead Threads (RaT) as a valuable solution for both reducing resource contention and exploiting memory-level parallelism in Simultaneous Multithreaded (SMT) processors. Our technique converts a resource intensive memory-bound thread to a speculative light thread under long-latency blocking memory operations. These speculative threads prefetch data and instructions with minimal resources, reducing critical resource conflicts between threads. We compare an SMT architecture using RaT to both state-of-the-art static fetch policies and dynamic resource control policies. In terms of throughput and fairness, our results show that RaT performs better than any other policy. The proposed mechanism improves average throughput by 37% regarding previous static fetch policies and by 28% compared to previous dynamic resource scheduling mechanisms. RaT also improves fairness by 36% and 30% respectively. In addition, the proposed mechanism permits register file size reduction of up to 60% in a SMT processor without performance degradation.Peer ReviewedPostprint (published version
- …